TTDS Lecture 2
TTDS Lecture 2
TECHNIQUES FOR
DATA SCIENCE
LECTURE 2
Data, Pre-processing and Post-processing
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents: term-
frequency vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
Molecular Structures
Ordered
TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences
2 Beer, Bread
Genetic sequence data
3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
Data Objects
Data Result
Preprocessin Data Mining Post-
g processing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Data Cleaning
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different (e.g., J.D.Smith and Jonh
Smith may refer to the same person)
possible reasons: different representations, different
scales, e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear
relationship)
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Log-linear models: obtain value at a point in m-D space
as the product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Discretization
Three types of attributes:
Nominal — values from an unordered set
Discretization:
divide the range of a continuous attribute into intervals
why?
Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Visualization
The human eye is a powerful analytical tool
If we visualize the data properly, we can discover patterns
Visualization is the way to present the data so that patterns can be seen
E.g., histograms and plots are a form of visualization
There are multiple techniques (a field on its own)
Scatter Plot Array of Iris Attributes
Contour Plot Example: SST Sea surface
temperature Dec, 1998
Celsius