0% found this document useful (0 votes)

8 views

TTDS Lecture 2

Uploaded by

ABDULLAH ASIF BUBBAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

TTDS Lecture 2

Uploaded by

ABDULLAH ASIF BUBBAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

TOOLS &

TECHNIQUES FOR
DATA SCIENCE
LECTURE 2
Data, Pre-processing and Post-processing
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
 Document data: text documents: term-
frequency vector
 Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
 Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
 Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

 Molecular Structures
 Ordered
TID Items
 Video data: sequence of images

1 Bread, Coke, Milk
Temporal data: time-series
 Sequential Data: transaction sequences
2 Beer, Bread
 Genetic sequence data
3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
Data Collection
Data
Collection

Data Result
Preprocessin Data Mining Post-
g processing

 Today there is an abundance of data online

 Facebook, Twitter, Wikipedia, Web, etc…
 We can extract interesting information from this data, but first we need to collect it
 Customized crawlers, use of public APIs
 Additional cleaning/processing to parse out the useful parts
 Respect of crawling etiquette
Data Quality

 Examples of data quality problems:

Tid Refund Marital Taxable
 Noise and outliers Status Income Cheat
 Missing values 1 Yes Single 125K No
 Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate
9 No Single 90K No
entries 10
Data Quality

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality
data
 Required for both OLAP and Data Mining!
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,
…
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
Why can Data be Incomplete?

 Attributes of interest are not available (e.g.,

customer information for sales transaction data)
 Data were not considered important at the time
of transactions, so they were not recorded!
 Data not recorded because of misunderstanding
or malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data
Why can Data be Noisy/Inconsistent?

 Faulty instruments for data collection

 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come at
a faster rate than they can be processed)
 Inconsistencies in naming conventions or data
codes (e.g., 2/5/2002 could be 2 May 2002 or 5
Feb 2002)
 Duplicate tuples, which were received twice
should also be removed
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing

(assuming the tasks in classification)—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
How to Handle Missing Data?

Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or

probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable
income based on the fact that the person is 39 years old
E.g., put the most frequent religion here
Inconsistent Data

 Inconsistent data are handled by:

 Manual correction (expensive and tedious)

 Use routines designed to detect inconsistencies and manually correct

them. E.g., the routine may use the check global constraints (age>10) or
functional dependencies

 Other inconsistencies (e.g., between names of the same attribute) can be

corrected during the data integration process
Data Integration

 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different (e.g., J.D.Smith and Jonh
Smith may refer to the same person)
 possible reasons: different representations, different
scales, e.g., metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

 Redundant data occur often when integration of

multiple databases
 The same attribute may have different names in different
databases
 One attribute may be a “derived” attribute in another table,
e.g., annual revenue
 Redundant data may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
 The larger the Χ2 value, the more likely the
variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example

Play Not play Sum

chess chess (row)
Like science fiction 250(90) 200(360) 450

Not like science 50(210) 1000(840) 1050

fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis

are expected counts calculated based on the data
distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840

 It shows that like_science_fiction and play_chess are

correlated in the group
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product

moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear
relationship)

 Correlation measures the linear relationship between objects

 To compute correlation, we standardize data objects, A and B, and
then take their dot product

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, and

A are the respective mean or
B
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

 Question: If the stocks are affected by the same industry trends,

will their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Data Reduction Strategies

 Warehouse may store terabytes of data: Complex

data analysis/mining may take a very long time to
run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

 Ex. Let income range $12,000 to $98,000 normalized to

73,600  12,000
[0.0, 1.0]. Then $73,000 is mapped (1.0  0)  0 0.716
98,000 to
 12,000

 Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A
73,600  54,000
1.225
16,000
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
v Where j is the smallest integer such that Max(|ν’|) < 1
v' 
10 j
Data Cube Aggregation

 The lowest level of a data cube

 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve
the task
 Queries regarding aggregated information should be
answered using data cube, when possible
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Numerosity Reduction: Reduce the volume of data

 Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Log-linear models: obtain value at a point in m-D space
as the product on appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Discretization
 Three types of attributes:
 Nominal — values from an unordered set

 Ordinal — values from an ordered set

 Continuous — real numbers

 Discretization:
 divide the range of a continuous attribute into intervals

 why?

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Prepare for further analysis, e.g., classification

How to Handle Noisy Data?
Smoothing techniques
 Binning method
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 computer detects suspicious values, which are then checked
by humans
 Regression
 smooth by fitting the data into regression functions
 Use Concept hierarchies
 use concept hierarchies, e.g., price value -> “expensive”
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Simple Discretization Methods:
Binning
Example: customer ages number
of values

Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those

having similar distributions of classes, i.e., low χ2 values) to
merge

 Merge performed recursively, until a predefined stopping

condition
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in
a data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
 Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
 Concept hierarchy can be automatically formed for both
numeric and nominal data. For numeric data, use
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels)
by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state,
country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Post-processing

 Visualization
 The human eye is a powerful analytical tool
 If we visualize the data properly, we can discover patterns
 Visualization is the way to present the data so that patterns can be seen
 E.g., histograms and plots are a form of visualization
 There are multiple techniques (a field on its own)
Scatter Plot Array of Iris Attributes
Contour Plot Example: SST Sea surface
temperature Dec, 1998

Celsius

Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
DM_merged
No ratings yet
DM_merged
169 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Lec7
No ratings yet
Lec7
45 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
data mining 3
No ratings yet
data mining 3
57 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
PPT1
No ratings yet
PPT1
93 pages
Module 2
No ratings yet
Module 2
62 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Correlation
No ratings yet
Correlation
14 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Snowflake Schema
No ratings yet
Snowflake Schema
13 pages
Samvad 4
No ratings yet
Samvad 4
108 pages
LECTURE 3-Data Resource Management
No ratings yet
LECTURE 3-Data Resource Management
44 pages
Redshift Vs Snowflake - An In-Depth Comparison PDF
100% (2)
Redshift Vs Snowflake - An In-Depth Comparison PDF
19 pages
Data Warehousing Glossary
No ratings yet
Data Warehousing Glossary
11 pages
CIS Theory - BusinessIntelligence
No ratings yet
CIS Theory - BusinessIntelligence
14 pages
Aws Certified Data Engineer Slides
No ratings yet
Aws Certified Data Engineer Slides
711 pages
Turnitin Group Screening Interview Questions
No ratings yet
Turnitin Group Screening Interview Questions
4 pages
Big Data and Data Warehouse
No ratings yet
Big Data and Data Warehouse
19 pages
Implement Core Principles in Your Process: André Morys
No ratings yet
Implement Core Principles in Your Process: André Morys
41 pages
What Is A Three Tier Data Warehouse?
No ratings yet
What Is A Three Tier Data Warehouse?
13 pages
L03A-Dimensional Modeling I
No ratings yet
L03A-Dimensional Modeling I
27 pages
What Is Dimensional Model
No ratings yet
What Is Dimensional Model
7 pages
CMA US Section F Part 1 QA
No ratings yet
CMA US Section F Part 1 QA
53 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
30 pages
Midterm Written Activity 3
No ratings yet
Midterm Written Activity 3
4 pages
IBM Courses
No ratings yet
IBM Courses
19 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
25 pages
Sg 248570
No ratings yet
Sg 248570
182 pages
Report On Data Warehousing
No ratings yet
Report On Data Warehousing
12 pages
W3Schools SQL
No ratings yet
W3Schools SQL
6 pages
IBM DATASTAGE Online Training, Online Datastage Training
No ratings yet
IBM DATASTAGE Online Training, Online Datastage Training
5 pages
Part VIII: eCRM (Customer Relationship Management)
No ratings yet
Part VIII: eCRM (Customer Relationship Management)
23 pages
Data Mining: by Doug Alexander
No ratings yet
Data Mining: by Doug Alexander
6 pages
Bus Matrix
No ratings yet
Bus Matrix
9 pages
Raza Kashif - 2023
No ratings yet
Raza Kashif - 2023
7 pages
Vjkarthigaa
No ratings yet
Vjkarthigaa
5 pages
Data Science QB Solve SEM6
No ratings yet
Data Science QB Solve SEM6
157 pages

TTDS Lecture 2

Uploaded by

TTDS Lecture 2

Uploaded by

TOOLS &

 Data sets are made up of data objects.

 Today there is an abundance of data online

 Examples of data quality problems:

A mistake or a millionaire? 5 No Divorced 10000K Yes

 Data in the real world is dirty

 Measures for data quality: A multidimensional view

 Attributes of interest are not available (e.g.,

 Faulty instruments for data collection

 Data cleaning tasks

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Ignore the tuple: usually done when class label is missing

Age Income Religion Gender

Fill missing values using aggregate functions (e.g., average) or

 Inconsistent data are handled by:

 Use routines designed to detect inconsistencies and manually correct

 Other inconsistencies (e.g., between names of the same attribute) can be

 Redundant data occur often when integration of

Play Not play Sum

Not like science 50(210) 1000(840) 1050

 Χ2 (chi-square) calculation (numbers in parenthesis

 It shows that like_science_fiction and play_chess are

 Correlation coefficient (also called Pearson’s product

where n is the number of tuples, A and B are the respective

 If rA,B > 0, A and B are positively correlated (A’s values

 Correlation measures the linear relationship between objects

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

where n is the number of tuples, and

 It can be simplified in computation as

 Question: If the stocks are affected by the same industry trends,

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

 Smoothing: remove noise from data

 Warehouse may store terabytes of data: Complex

 Ex. Let income range $12,000 to $98,000 normalized to

 Z-score normalization (μ: mean, σ: standard deviation):

 The lowest level of a data cube

 Feature selection (i.e., attribute subset selection):

 Ordinal — values from an ordered set

 Continuous — real numbers

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Prepare for further analysis, e.g., classification

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Correlation analysis (e.g., Chi-merge: χ2-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those

 Merge performed recursively, until a predefined stopping

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like