Unit 3 covers data pre-processing techniques essential for data analysis, including handling missing data, data cleaning, integration, and transformation. It discusses various types of data attributes, their significance, and methods for ensuring data quality. Key tasks include data reduction, feature selection, and the application of techniques like PCA and normalization to enhance data usability.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
2 views
Unit3
Unit 3 covers data pre-processing techniques essential for data analysis, including handling missing data, data cleaning, integration, and transformation. It discusses various types of data attributes, their significance, and methods for ensuring data quality. Key tasks include data reduction, feature selection, and the application of techniques like PCA and normalization to enhance data usability.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41
Unit 3: Data pre-processing:
Chapter 2 and Chapter 3 of Han-Kamber 3 rd edition
Need Data pre-processing,
Attributes and Data types, Statistical descriptions of Data, Handling missing Data, Data sampling, Data cleaning, Data Integration and transformation, Data reduction-– Curse of Dimensionality, Feature Selection and Feature Engineering, Principal Component Analysis (PCA), Discretization and generating concept hierarchies. Data objects and attributes ❖Data sets are made up of data objects ❖Data object – entity, sample, example, instance, data point, tuple, row ❖Attribute – data field, dimension, feature, variable ❖Observation – observed value of an attribute ❖Attribute vector – Feature vector – a set of attributes used to describe a given object
10/15/2022 Compiled by PROF. SURABHI THATTE
Types of attribute ❖Nominal = Categorical ❖Relating to names ❖Values are symbols or names of things ❖Each value represents a category, code, state ❖No meaningful order ❖Example: 1. Hair_color: black, brown, blond 2. Marital_status: single, married, divorced 3. Occupation: teacher, doctor, farmer ❖Can also be represented by numbers (1=red, 2=black) ❖No mathematical operations, no meaningful order, not quantitative ❖Possible to find mode – most commonly occurring value
10/15/2022 Compiled by PROF. SURABHI THATTE
Types of attribute ❖Binary Attributes ❖Nominal attribute with only 2 categories: 0 or 1 ❖True/False, Present/Absent, Positive/Negative, Yes/No ❖Examples: ❖Diabetic: yes/no ❖Cancer: yes/no ❖Anomalous: true/false ❖Symmetric – If both states are equally valuable and carry same weight ❖Asymmetric – If outcomes have different importance ❖Most important or rarest outcome is coded as 1 ❖Example: Dengue positive: 1 , Dengue negative: 0
10/15/2022 Compiled by PROF. SURABHI THATTE
Types of attribute ❖Ordinal Attributes ❖The values have meaningful order or ranking among thme ❖Magnitude between successive values is not known ❖Example: ❖Customer_satisfaction: very satisfied, somewhat satisfied, neutral, dissatisfied ❖Size_of_beverage: small, medium, large ❖Professional_rank: assistant professor, associate professor, professor ❖Useful for registering subjective assessment of qualities ❖Mean cannot be defined, but median and mode can be defined ❖Qualitative attribute – actual quantity not given
10/15/2022 Compiled by PROF. SURABHI THATTE
Numeric Attributes ❖Interval-Scaled Attributes ❖Measured on the scale of equal-size units ❖Values have order and can be positive or negative ❖Difference between values can be compared and quantified ❖We cannot speak of values in terms of ratio ❖Mean, median, mode can be calculated ❖Example: Temperature, Date ❖Ratio-scaled Attributes ❖Numeric attribute with an inherent zero-point ❖Difference and ratio can be calculated ❖Mean, median, mode can be calculated ❖Example: years_of_experience, number_of_words, weight, height
10/15/2022 Compiled by PROF. SURABHI THATTE
Discrete versus Continuous ❑Discrete attribute – finite or countably infinite set of values ❑Examples: number_of_students, drink_size, customer_id, zipcode ❑Continuous attribute – real numbers, floating-point variables ❑Example: height
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Quality: Why Preprocess the Data? Measures for data quality: A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
◦ Refer to Han-Kamber for more details
10/15/2022 Compiled by PROF. SURABHI THATTE
Major Tasks in Data Preprocessing Data cleaning ◦ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration ◦ Integration of multiple databases, data cubes, or files ◦ Resolving inconsistencies (customer_id vs cust_id) Data reduction- reduced volume but same analysis result ◦ Dimensionality reduction – wavelet transform, PCA ◦ Numerosity reduction – log linear models, clusters ◦ Data compression Data transformation and data discretization ◦ Normalization , discretization ◦ Concept hierarchy generation *Above categorization is not mutually exclusive. Removal of redundant data is data cleaning as well as data reduction
10/15/2022 Compiled by PROF. SURABHI THATTE
Forms of Data Preprocessing
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Cleaning
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Cleaning Data in the Real World is Dirty Reason: instrument faulty, human or computer error, transmission error ◦ Incomplete: lacking attribute values, lacking certain attributes of Interest, or containing only aggregate data ◦ e.g., Occupation = “ ” (missing data) ◦ Noisy: containing noise, errors, or outliers ◦ e.g., Salary = “−10” (an error) ◦ Inconsistent: containing discrepancies in codes or names, e.g., ◦ Age = “42”, Birthday = “03/07/2010” ◦ Was rating “1, 2, 3”, now rating “A, B, C” ◦ Discrepancy between duplicate records ◦ Intentional (e.g., disguised missing data) ◦ Jan. 1 as everyone’s birthday?
10/15/2022 Compiled by PROF. SURABHI THATTE
Incomplete (Missing) Data ▪ Data is not always available ▪ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ▪ Missing data may be due to ▪ equipment malfunction ▪ inconsistent with other recorded data and thus deleted ▪ data not entered due to misunderstanding / privacy issues ▪ certain data may not be considered important at the time of entry ▪ Missing data may need to be inferred ▪ Does missing value always imply error in the data? Justify.
10/15/2022 Compiled by PROF. SURABHI THATTE
How to Handle Missing Data? ❑ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ❑ Fill in the missing value manually: tedious + infeasible? ❑ Fill it automatically with ◦ a global constant : e.g., “unknown”, a new class?! ◦ the attribute mean or median ◦ the attribute mean for all samples belonging to the same class: smarter ◦ the most probable value: inference-based such as Bayesian formula or decision tree …by considering other attributes
10/15/2022 Compiled by PROF. SURABHI THATTE
Noisy Data ❑ What is noise? ❑ Random error, variance in a measured variable ❑ How do we identify noise? ❑ Boxplots, Scatter plots, other methods of data visualization ❑ Data Smoothing Techniques ❑ Binning ❑ Regression ❑ Outlier Analysis
10/15/2022 Compiled by PROF. SURABHI THATTE
Binning ➢ Binning methods smooth a sorted data value by consulting its neighborhood (local smoothing) ➢ Sorted values are distributed into a number of equal – frequency buckets (bins) ➢ Smoothing by bin means – each value of bin is replaced by mean value of bin ➢ Smoothing by bin medians – each bin value is replaced by bin median ➢ Smoothing by bin boundaries – minimum and maximum values in a given bin are identified as bin boundaries. Each bin value is replaced by closest boundary value. Smaller /Larger the width , greater is the effect of smoothing???
10/15/2022 Compiled by PROF. SURABHI THATTE
Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 10/15/2022 Compiled by PROF. SURABHI THATTE Regression ➢ Technique that transforms data values to a function ➢ Linear regression involves finding the best line to fit two attributes or variables so that one can be used to predict the other ➢ Example: years of exp to predict salary…..
10/15/2022 Compiled by PROF. SURABHI THATTE
Outlier analysis ❑ Outlier can be detected by clustering ❑ Outlier detection or anomaly detection is the process of finding data objects with behaviors that are very different than expectation ❑ Applications: ❑ Fraud detection, security, image processing, video analysis , intrusion detection
10/15/2022 Compiled by PROF. SURABHI THATTE
Discussion Is concept hierarchy a form of data discretization? Can it be used for data smoothing?
10/15/2022 Compiled by PROF. SURABHI THATTE
Tools for discrepancy detection ❑ Data scrubbing tools use simple domain knowledge (e.g. knowledge of postal address and spell-check) to detect errors and make corrections in the data ❑ Data auditing tools analyze data to discover rules and relationships and detect data that violate such conditions ❑ Potter’s Wheel is a publicly available data cleaning tool that does discrepancy detection and transformation
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Integration
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Integration ❑ Merging of data from multiple data stores ❑ Problems – redundancies and inconsistencies ❑ Challenges – matching schema and objects from different sources
10/15/2022 Compiled by PROF. SURABHI THATTE
Entity Identification Problem ❑ Problem of matching equivalent real world entities from multiple data sources ❑ How can a data analyst be sure that customer_id from one database and cust_number in another database refer to the same attribute? ❑ Metadata can help to avoid data integration issues ❑ Metadata for each attribute include name, meaning, data type, range of values permitted, null rules ❑ Functional dependencies and referential constraints should be taken care of during data integration
10/15/2022 Compiled by PROF. SURABHI THATTE
Handling Redundancy in Data Integration ❖ Redundant data occur often during integration of multiple databases ❖ Object identification: The same attribute or object may have different names in different databases ❖ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
❖ Redundant attributes may be able to be detected by chi-square
correlation test for nominal data, correlation coefficient test for numeric data or covariance analysis for numeric data
10/15/2022 Compiled by PROF. SURABHI THATTE
25 Other Problems in Integration ❑ Tuple Duplication - redundancy at tuple level ❑ Denormalization is one cause of redundancy ❑ Data value conflict detection – ‘weight’ attribute may be stored in different measurement systems in different databases ❑ Currencies and tax calculation rules are different for different countries
Attribute Subset Selection ▪ In multi-dimensional data, some attributes may be irrelevant to the data mining task ▪ Example – If the task is to classify customers based on whether or not they are likely to purchase a popular new CD at the store ▪ Relevant attributes – age, music_taste ▪ Irrelevant attributes – telephone number ▪ Domain expert can pick out relevant attributes , but time- consuming ▪ Attribute subset selection (Feature subset selection in ML) reduces data set size by removing irrelevant attributes
10/15/2022 Compiled by PROF. SURABHI THATTE
Finding good subset ▪ For ‘n’ attributes, there are 2n subsets ▪ Heuristic (greedy) methods are used for attribute subset selection ▪ These methods make locally optimal choice, hoping that it will lead to global optimal solution ▪ Best attributes are decided by measures such as ‘information gain’
10/15/2022 Compiled by PROF. SURABHI THATTE
Sampling ❖ Data reduction technique ❖ Allows a large dataset to be represented by a much smaller data sample ❖ Allows a mining algorithm to run in complexity that is potentially sub-linear to the size of the data ❖ Key principle: Choose a representative subset of the data
10/15/2022 Compiled by PROF. SURABHI THATTE
Types of Sampling • Sampling without replacement • Once an object is selected, it is removed from the population • Sampling with replacement • A selected object is not removed from the population • Cluster sample • If tuples in D are grouped into M disjoint clusters, then a sample from each cluster can be obtained • Stratified sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) • Used in conjunction with skewed data • Example – creating a stratum for each age group
10/15/2022 Compiled by PROF. SURABHI THATTE
Sampling: with or without Replacement
Raw Data
10/15/2022 Compiled by PROF. SURABHI THATTE
Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Transformation
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Transformation 1. Data are transformed and consolidated so that the resulting mining process is efficient. 2. Strategies for data transformation 1. Smoothing 2. Attribute construction – attribute discovery 3. Aggregation 4. Normalization 5. Discretization 6. Concept hierarchy generation
10/15/2022 Compiled by PROF. SURABHI THATTE
Normalization ➢ Normalizing the data attempts to give all attributes an equal weight ➢ For distance based methods, normalization helps prevent attributes with initially large ranges (e.g. income) from outweighing attributes with smaller ranges (e.g. age) ➢ It removes dependence on measurement units ➢ Normalization involves transforming the data to fall within a smaller or common range such as [-1,1] or [0.0,1.0] ➢ Normalization is useful for algorithms like Neural Networks, or distance based algorithms like Nearest Neighbour classification as well as Clustering ➢ Methods: ➢ Min-max normalization ➢ Z-score normalization ➢ Decimal Scaling
10/15/2022 Compiled by PROF. SURABHI THATTE
Min- Max Normalization • Let A be a numeric attribute (e.g. income) with n observed values
• Let minA and maxA be the minmum and maximum values of A
• Min-Max Normalization maps a value v of A to v’ in the range [
new_minA, new_maxA]
v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA • Ex. Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]
• Then $73,600 is mapped to
73,600 − 12,000 (1.0 − 0) + 0 = 0.716 98,000 − 12,000 10/15/2022 Compiled by PROF. SURABHI THATTE Discretization Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by discretization ◦ Supervised vs. unsupervised ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
10/15/2022 Compiled by PROF. SURABHI THATTE
Extra
10/15/2022 Compiled by PROF. SURABHI THATTE
Data Wrangling ❖ Data Wrangling is the process of converting and mapping data from its raw form to another format with the purpose of making it more valuable and appropriate for advance tasks such as Data Analytics and Machine Learning. ❖ Difference between Data Wrangling and ETL ❖ Users – Business Analysts vs IT employees ❖ Data – diverse, complex vs well structured ❖ Use Cases – Exploratory vs Reporting & Analysis ❖ Yet, Data Wrangling and ETL are complementary