0% found this document useful (0 votes)
8 views

TTDS Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

TTDS Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 2
Data, Pre-processing and Post-processing
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
 Document data: text documents: term-
frequency vector
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

 Molecular Structures
 Ordered
TID Items
 Video data: sequence of images

1 Bread, Coke, Milk
Temporal data: time-series
 Sequential Data: transaction sequences
2 Beer, Bread
 Genetic sequence data
3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
Data Collection
Data
Collection

Data Result
Preprocessin Data Mining Post-
g processing

 Today there is an abundance of data online


 Facebook, Twitter, Wikipedia, Web, etc…
 We can extract interesting information from this data, but first we need to collect it
 Customized crawlers, use of public APIs
 Additional cleaning/processing to parse out the useful parts
 Respect of crawling etiquette
Data Quality

 Examples of data quality problems:


Tid Refund Marital Taxable
 Noise and outliers Status Income Cheat
 Missing values 1 Yes Single 125K No
 Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate
9 No Single 90K No
entries 10
Data Quality

 Data in the real world is dirty


 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality
data
 Required for both OLAP and Data Mining!
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,

 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
Why can Data be Incomplete?

 Attributes of interest are not available (e.g.,


customer information for sales transaction data)
 Data were not considered important at the time
of transactions, so they were not recorded!
 Data not recorded because of misunderstanding
or malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Why can Data be Noisy/Inconsistent?

 Faulty instruments for data collection


 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come at
a faster rate than they can be processed)
 Inconsistencies in naming conventions or data
codes (e.g., 2/5/2002 could be 2 May 2002 or 5
Feb 2002)
 Duplicate tuples, which were received twice
should also be removed
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning

 Data cleaning tasks


 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data


How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing


(assuming the tasks in classification)—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
How to Handle Missing Data?

Age Income Religion Gender


23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or


probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable
income based on the fact that the person is 39 years old
E.g., put the most frequent religion here
Inconsistent Data

 Inconsistent data are handled by:


 Manual correction (expensive and tedious)

 Use routines designed to detect inconsistencies and manually correct


them. E.g., the routine may use the check global constraints (age>10) or
functional dependencies

 Other inconsistencies (e.g., between names of the same attribute) can be


corrected during the data integration process
Data Integration

 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different (e.g., J.D.Smith and Jonh
Smith may refer to the same person)
 possible reasons: different representations, different
scales, e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

 Redundant data occur often when integration of


multiple databases
 The same attribute may have different names in different
databases
 One attribute may be a “derived” attribute in another table,
e.g., annual revenue
 Redundant data may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
 The larger the Χ2 value, the more likely the
variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example

Play Not play Sum


chess chess (row)
Like science fiction 250(90) 200(360) 450

Not like science 50(210) 1000(840) 1050


fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis


are expected counts calculated based on the data
distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840

 It shows that like_science_fiction and play_chess are


correlated in the group
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product


moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values


increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear
relationship)

 Correlation measures the linear relationship between objects


 To compute correlation, we standardize data objects, A and B, and
then take their dot product

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '


Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, and


A are the respective mean or
B
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

 Question: If the stocks are affected by the same industry trends,


will their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.


Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Data Reduction Strategies

 Warehouse may store terabytes of data: Complex


data analysis/mining may take a very long time to
run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

 Ex. Let income range $12,000 to $98,000 normalized to


73,600  12,000
[0.0, 1.0]. Then $73,000 is mapped (1.0  0)  0 0.716
98,000 to
 12,000

 Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A
73,600  54,000
1.225
16,000
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
v Where j is the smallest integer such that Max(|ν’|) < 1
v' 
10 j
Data Cube Aggregation

 The lowest level of a data cube


 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve
the task
 Queries regarding aggregated information should be
answered using data cube, when possible
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):


 Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Numerosity Reduction: Reduce the volume of data

 Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Log-linear models: obtain value at a point in m-D space
as the product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Discretization
 Three types of attributes:
 Nominal — values from an unordered set

 Ordinal — values from an ordered set

 Continuous — real numbers

 Discretization:
 divide the range of a continuous attribute into intervals

 why?

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Prepare for further analysis, e.g., classification


How to Handle Noisy Data?
Smoothing techniques
 Binning method
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 computer detects suspicious values, which are then checked
by humans
 Regression
 smooth by fitting the data into regression functions
 Use Concept hierarchies
 use concept hierarchies, e.g., price value -> “expensive”
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Simple Discretization Methods:
Binning
Example: customer ages number
of values

Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those


having similar distributions of classes, i.e., low χ2 values) to
merge

 Merge performed recursively, until a predefined stopping


condition
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in
a data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
 Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
 Concept hierarchy can be automatically formed for both
numeric and nominal data. For numeric data, use
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels)
by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state,
country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Post-processing

 Visualization
 The human eye is a powerful analytical tool
 If we visualize the data properly, we can discover patterns
 Visualization is the way to present the data so that patterns can be seen
 E.g., histograms and plots are a form of visualization
 There are multiple techniques (a field on its own)
Scatter Plot Array of Iris Attributes
Contour Plot Example: SST Sea surface
temperature Dec, 1998

Celsius

You might also like