0% found this document useful (0 votes)

58 views14 pages

Correlation

The document discusses data preprocessing techniques which include data cleaning, integration, and reduction. It covers handling incomplete, noisy and inconsistent data through methods such as filling in missing values, detecting and removing outliers, and resolving inconsistencies between multiple data sources.

Uploaded by

Muneeba Hussain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views14 pages

Correlation

Uploaded by

Muneeba Hussain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Mining:

Concepts and Techniques

1
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

2
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
3
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
4
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
5
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
6
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

7
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

8
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

9
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
10
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
11
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

12
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories) Expected count= row total* col total
/Grand total
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group

13
04/20/24 14

VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
2-Data Fundamentals For BI - Part1
No ratings yet
2-Data Fundamentals For BI - Part1
39 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
CH 3
No ratings yet
CH 3
68 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
3 Processing
No ratings yet
3 Processing
79 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit I
No ratings yet
Unit I
57 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Data Quality and Preprocessing Techniques
No ratings yet
Data Quality and Preprocessing Techniques
63 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
14 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
DP
No ratings yet
DP
44 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
41 pages
Lec 3
No ratings yet
Lec 3
31 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
60 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Mission Bkash
No ratings yet
Mission Bkash
1 page
Arduino Starter Kit Guide
No ratings yet
Arduino Starter Kit Guide
113 pages
TTL Report
No ratings yet
TTL Report
25 pages
ZigBee Wireless Network Overview
No ratings yet
ZigBee Wireless Network Overview
18 pages
Enhancements To Content Administration - SAP Help Portal
No ratings yet
Enhancements To Content Administration - SAP Help Portal
3 pages
Abb Acs800 Crane Control Manualn697 PDF
100% (1)
Abb Acs800 Crane Control Manualn697 PDF
340 pages
Resume and CV Writing Guide
No ratings yet
Resume and CV Writing Guide
13 pages
Touchless Automation in Touch Screens
No ratings yet
Touchless Automation in Touch Screens
29 pages
How To Model Viral Growth
100% (2)
How To Model Viral Growth
24 pages
Barracuda Waf Vs Fortiweb
No ratings yet
Barracuda Waf Vs Fortiweb
1 page
Curriculum Vitae: Prathamesh P. Mashilkar
No ratings yet
Curriculum Vitae: Prathamesh P. Mashilkar
3 pages
Advanced Encryption Standard Guide
No ratings yet
Advanced Encryption Standard Guide
11 pages
CultFit FAQs - 02822e35cca - 1717213838275
No ratings yet
CultFit FAQs - 02822e35cca - 1717213838275
2 pages
Freenas9.1.0 Guide
No ratings yet
Freenas9.1.0 Guide
272 pages
Mobile Payment Systems Overview
No ratings yet
Mobile Payment Systems Overview
26 pages
Honeywell DCS EPKS Training Module - Part3
No ratings yet
Honeywell DCS EPKS Training Module - Part3
500 pages
Steve's NVT-HV Indicator
No ratings yet
Steve's NVT-HV Indicator
8 pages
Salesforce CRMconsulting
100% (1)
Salesforce CRMconsulting
25 pages
Bece355l Aws-For-Cloud-Computing TH 1.0 80 Bece355l
No ratings yet
Bece355l Aws-For-Cloud-Computing TH 1.0 80 Bece355l
2 pages
C++ Programming Exercises
No ratings yet
C++ Programming Exercises
3 pages
Deswik - Suite 2021.1 Release Notes
100% (4)
Deswik - Suite 2021.1 Release Notes
224 pages
Zhejiang University 2003-2004 Computer Science Exam
No ratings yet
Zhejiang University 2003-2004 Computer Science Exam
8 pages
Distributed Systems Unit-1 Notes
No ratings yet
Distributed Systems Unit-1 Notes
18 pages
X1782 MLB - 820-01987 - 820-01987-06
No ratings yet
X1782 MLB - 820-01987 - 820-01987-06
100 pages
2025-04-13
No ratings yet
2025-04-13
2 pages
03 - TRAINING PROGRAM - SCS Operation & Troubleshooting For S8
100% (1)
03 - TRAINING PROGRAM - SCS Operation & Troubleshooting For S8
54 pages
Semi Custom Design Flow Leveraging Place
No ratings yet
Semi Custom Design Flow Leveraging Place
20 pages
Vehicle Speed Detection Using Computer Vision Technique
No ratings yet
Vehicle Speed Detection Using Computer Vision Technique
16 pages
Who Wants To Be A Millionaire
No ratings yet
Who Wants To Be A Millionaire
49 pages
PIDKey Lite by Ratiborus - EN
No ratings yet
PIDKey Lite by Ratiborus - EN
25 pages

Correlation

Uploaded by

Correlation

Uploaded by

Data Mining:

Concepts and Techniques

 Measures for data quality: A multidimensional view

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Combined computer and human inspection

deal with possible outliers)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

relationship to detect violators (e.g., correlation and clustering

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Redundant data occur often when integration of multiple

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

You might also like