0% found this document useful (0 votes)
2 views

Unit3

Unit 3 covers data pre-processing techniques essential for data analysis, including handling missing data, data cleaning, integration, and transformation. It discusses various types of data attributes, their significance, and methods for ensuring data quality. Key tasks include data reduction, feature selection, and the application of techniques like PCA and normalization to enhance data usability.

Uploaded by

akashilay0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit3

Unit 3 covers data pre-processing techniques essential for data analysis, including handling missing data, data cleaning, integration, and transformation. It discusses various types of data attributes, their significance, and methods for ensuring data quality. Key tasks include data reduction, feature selection, and the application of techniques like PCA and normalization to enhance data usability.

Uploaded by

akashilay0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Unit 3: Data pre-processing:

Chapter 2 and Chapter 3 of Han-Kamber 3 rd edition

 Need Data pre-processing,


 Attributes and Data types,
 Statistical descriptions of Data,
 Handling missing Data,
 Data sampling,
 Data cleaning,
 Data Integration and transformation,
 Data reduction-– Curse of Dimensionality,
 Feature Selection and Feature Engineering,
 Principal Component Analysis (PCA),
 Discretization and generating concept hierarchies.
Data objects and attributes
❖Data sets are made up of data objects
❖Data object – entity, sample, example, instance, data point, tuple, row
❖Attribute – data field, dimension, feature, variable
❖Observation – observed value of an attribute
❖Attribute vector – Feature vector – a set of attributes used to describe
a given object

10/15/2022 Compiled by PROF. SURABHI THATTE


Types of attribute
❖Nominal = Categorical
❖Relating to names
❖Values are symbols or names of things
❖Each value represents a category, code, state
❖No meaningful order
❖Example:
1. Hair_color: black, brown, blond
2. Marital_status: single, married, divorced
3. Occupation: teacher, doctor, farmer
❖Can also be represented by numbers (1=red, 2=black)
❖No mathematical operations, no meaningful order, not quantitative
❖Possible to find mode – most commonly occurring value

10/15/2022 Compiled by PROF. SURABHI THATTE


Types of attribute
❖Binary Attributes
❖Nominal attribute with only 2 categories: 0 or 1
❖True/False, Present/Absent, Positive/Negative, Yes/No
❖Examples:
❖Diabetic: yes/no
❖Cancer: yes/no
❖Anomalous: true/false
❖Symmetric – If both states are equally valuable and carry
same weight
❖Asymmetric – If outcomes have different importance
❖Most important or rarest outcome is coded as 1
❖Example: Dengue positive: 1 , Dengue negative: 0

10/15/2022 Compiled by PROF. SURABHI THATTE


Types of attribute
❖Ordinal Attributes
❖The values have meaningful order or ranking among thme
❖Magnitude between successive values is not known
❖Example:
❖Customer_satisfaction: very satisfied, somewhat satisfied, neutral,
dissatisfied
❖Size_of_beverage: small, medium, large
❖Professional_rank: assistant professor, associate professor,
professor
❖Useful for registering subjective assessment of qualities
❖Mean cannot be defined, but median and mode can be defined
❖Qualitative attribute – actual quantity not given

10/15/2022 Compiled by PROF. SURABHI THATTE


Numeric Attributes
❖Interval-Scaled Attributes
❖Measured on the scale of equal-size units
❖Values have order and can be positive or negative
❖Difference between values can be compared and quantified
❖We cannot speak of values in terms of ratio
❖Mean, median, mode can be calculated
❖Example: Temperature, Date
❖Ratio-scaled Attributes
❖Numeric attribute with an inherent zero-point
❖Difference and ratio can be calculated
❖Mean, median, mode can be calculated
❖Example: years_of_experience, number_of_words, weight, height

10/15/2022 Compiled by PROF. SURABHI THATTE


Discrete versus Continuous
❑Discrete attribute – finite or countably infinite set of values
❑Examples: number_of_students, drink_size, customer_id, zipcode
❑Continuous attribute – real numbers, floating-point variables
❑Example: height

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Quality: Why
Preprocess the Data?
Measures for data quality: A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not, dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be understood?

◦ Refer to Han-Kamber for more details

10/15/2022 Compiled by PROF. SURABHI THATTE


Major Tasks in Data
Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
◦ Integration of multiple databases, data cubes, or files
◦ Resolving inconsistencies (customer_id vs cust_id)
Data reduction- reduced volume but same analysis result
◦ Dimensionality reduction – wavelet transform, PCA
◦ Numerosity reduction – log linear models, clusters
◦ Data compression
Data transformation and data discretization
◦ Normalization , discretization
◦ Concept hierarchy generation
*Above categorization is not mutually exclusive. Removal of redundant data is data
cleaning as well as data reduction

10/15/2022 Compiled by PROF. SURABHI THATTE


Forms of Data Preprocessing

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Cleaning

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Cleaning
Data in the Real World is Dirty
Reason: instrument faulty, human or computer error, transmission error
◦ Incomplete: lacking attribute values, lacking certain attributes of
Interest, or containing only aggregate data
◦ e.g., Occupation = “ ” (missing data)
◦ Noisy: containing noise, errors, or outliers
◦ e.g., Salary = “−10” (an error)
◦ Inconsistent: containing discrepancies in codes or names, e.g.,
◦ Age = “42”, Birthday = “03/07/2010”
◦ Was rating “1, 2, 3”, now rating “A, B, C”
◦ Discrepancy between duplicate records
◦ Intentional (e.g., disguised missing data)
◦ Jan. 1 as everyone’s birthday?

10/15/2022 Compiled by PROF. SURABHI THATTE


Incomplete (Missing) Data
▪ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
▪ Missing data may be due to
▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding / privacy issues
▪ certain data may not be considered important at the time of entry
▪ Missing data may need to be inferred
▪ Does missing value always imply error in the data? Justify.

10/15/2022 Compiled by PROF. SURABHI THATTE


How to Handle Missing Data?
❑ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably
❑ Fill in the missing value manually: tedious + infeasible?
❑ Fill it automatically with
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute mean or median
◦ the attribute mean for all samples belonging to the same class:
smarter
◦ the most probable value: inference-based such as Bayesian formula or
decision tree …by considering other attributes

10/15/2022 Compiled by PROF. SURABHI THATTE


Noisy Data
❑ What is noise?
❑ Random error, variance in a measured variable
❑ How do we identify noise?
❑ Boxplots, Scatter plots, other methods of data visualization
❑ Data Smoothing Techniques
❑ Binning
❑ Regression
❑ Outlier Analysis

10/15/2022 Compiled by PROF. SURABHI THATTE


Binning
➢ Binning methods smooth a sorted data value by consulting its
neighborhood (local smoothing)
➢ Sorted values are distributed into a number of equal – frequency
buckets (bins)
➢ Smoothing by bin means – each value of bin is replaced by mean
value of bin
➢ Smoothing by bin medians – each bin value is replaced by bin
median
➢ Smoothing by bin boundaries – minimum and maximum values in a
given bin are identified as bin boundaries. Each bin value is replaced
by closest boundary value.
Smaller /Larger the width , greater is the effect of smoothing???

10/15/2022 Compiled by PROF. SURABHI THATTE


Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
10/15/2022 Compiled by PROF. SURABHI THATTE
Regression
➢ Technique that transforms data values to a function
➢ Linear regression involves finding the best line to fit two attributes
or variables so that one can be used to predict the other
➢ Example: years of exp to predict salary…..

10/15/2022 Compiled by PROF. SURABHI THATTE


Outlier analysis
❑ Outlier can be detected by clustering
❑ Outlier detection or anomaly detection is the process of finding data
objects with behaviors that are very different than expectation
❑ Applications:
❑ Fraud detection, security, image processing, video analysis ,
intrusion detection

10/15/2022 Compiled by PROF. SURABHI THATTE


Discussion
Is concept hierarchy a form of data discretization?
Can it be used for data smoothing?

10/15/2022 Compiled by PROF. SURABHI THATTE


Tools for discrepancy detection
❑ Data scrubbing tools use simple domain knowledge (e.g. knowledge
of postal address and spell-check) to detect errors and make
corrections in the data
❑ Data auditing tools analyze data to discover rules and relationships
and detect data that violate such conditions
❑ Potter’s Wheel is a publicly available data cleaning tool that does
discrepancy detection and transformation

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Integration

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Integration
❑ Merging of data from multiple data stores
❑ Problems – redundancies and inconsistencies
❑ Challenges – matching schema and objects from different sources

10/15/2022 Compiled by PROF. SURABHI THATTE


Entity Identification Problem
❑ Problem of matching equivalent real world entities from multiple
data sources
❑ How can a data analyst be sure that customer_id from one database
and cust_number in another database refer to the same attribute?
❑ Metadata can help to avoid data integration issues
❑ Metadata for each attribute include name, meaning, data type,
range of values permitted, null rules
❑ Functional dependencies and referential constraints should be
taken care of during data integration

10/15/2022 Compiled by PROF. SURABHI THATTE


Handling Redundancy in Data Integration
❖ Redundant data occur often during integration of multiple databases
❖ Object identification: The same attribute or object may have
different names in different databases
❖ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue

❖ Redundant attributes may be able to be detected by chi-square


correlation test for nominal data, correlation coefficient test for
numeric data or covariance analysis for numeric data

10/15/2022 Compiled by PROF. SURABHI THATTE


25
Other Problems in Integration
❑ Tuple Duplication - redundancy at tuple level
❑ Denormalization is one cause of redundancy
❑ Data value conflict detection – ‘weight’ attribute may be stored in
different measurement systems in different databases
❑ Currencies and tax calculation rules are different for different
countries

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Reduction

10/15/2022 Compiled by PROF. SURABHI THATTE


Strategies
❖ Discrete Wavelet Transforms(DWT)
❖ Principal Components Analysis (PCA)
❖ Attribute subset selection
❖ Clustering
❖ Sampling
❖ Data Cube Aggregation

10/15/2022 Compiled by PROF. SURABHI THATTE


Attribute Subset Selection
▪ In multi-dimensional data, some attributes may be irrelevant to the
data mining task
▪ Example – If the task is to classify customers based on whether or
not they are likely to purchase a popular new CD at the store
▪ Relevant attributes – age, music_taste
▪ Irrelevant attributes – telephone number
▪ Domain expert can pick out relevant attributes , but time-
consuming
▪ Attribute subset selection (Feature subset selection in ML) reduces
data set size by removing irrelevant attributes

10/15/2022 Compiled by PROF. SURABHI THATTE


Finding good subset
▪ For ‘n’ attributes, there are 2n subsets
▪ Heuristic (greedy) methods are used for attribute subset selection
▪ These methods make locally optimal choice, hoping that it will lead
to global optimal solution
▪ Best attributes are decided by measures such as ‘information gain’

10/15/2022 Compiled by PROF. SURABHI THATTE


Sampling
❖ Data reduction technique
❖ Allows a large dataset to be represented by a much smaller data
sample
❖ Allows a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
❖ Key principle: Choose a representative subset of the data

10/15/2022 Compiled by PROF. SURABHI THATTE


Types of Sampling
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Cluster sample
• If tuples in D are grouped into M disjoint clusters, then a sample from
each cluster can be obtained
• Stratified sampling
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
• Example – creating a stratum for each age group

10/15/2022 Compiled by PROF. SURABHI THATTE


Sampling: with or without Replacement

Raw Data

10/15/2022 Compiled by PROF. SURABHI THATTE


Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Transformation

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Transformation
1. Data are transformed and consolidated so that the resulting mining
process is efficient.
2. Strategies for data transformation
1. Smoothing
2. Attribute construction – attribute discovery
3. Aggregation
4. Normalization
5. Discretization
6. Concept hierarchy generation

10/15/2022 Compiled by PROF. SURABHI THATTE


Normalization
➢ Normalizing the data attempts to give all attributes an equal weight
➢ For distance based methods, normalization helps prevent attributes with
initially large ranges (e.g. income) from outweighing attributes with smaller
ranges (e.g. age)
➢ It removes dependence on measurement units
➢ Normalization involves transforming the data to fall within a smaller or
common range such as [-1,1] or [0.0,1.0]
➢ Normalization is useful for algorithms like Neural Networks, or distance
based algorithms like Nearest Neighbour classification as well as Clustering
➢ Methods:
➢ Min-max normalization
➢ Z-score normalization
➢ Decimal Scaling

10/15/2022 Compiled by PROF. SURABHI THATTE


Min- Max Normalization
• Let A be a numeric attribute (e.g. income) with n observed values

• Let minA and maxA be the minmum and maximum values of A

• Min-Max Normalization maps a value v of A to v’ in the range [


new_minA, new_maxA]

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Ex. Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]

• Then $73,600 is mapped to


73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
10/15/2022 Compiled by PROF. SURABHI THATTE
Discretization
Discretization: Divide the range of a continuous attribute into
intervals
◦ Interval labels can then be used to replace actual data values
◦ Reduce data size by discretization
◦ Supervised vs. unsupervised
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an attribute
◦ Prepare for further analysis, e.g., classification

10/15/2022 Compiled by PROF. SURABHI THATTE


Extra

10/15/2022 Compiled by PROF. SURABHI THATTE


Data Wrangling
❖ Data Wrangling is the process of converting and mapping data from
its raw form to another format with the purpose of making it more
valuable and appropriate for advance tasks such as Data Analytics
and Machine Learning.
❖ Difference between Data Wrangling and ETL
❖ Users – Business Analysts vs IT employees
❖ Data – diverse, complex vs well structured
❖ Use Cases – Exploratory vs Reporting & Analysis
❖ Yet, Data Wrangling and ETL are complementary

10/15/2022 Compiled by PROF. SURABHI THATTE

You might also like