0% found this document useful (0 votes)

2 views

Unit3

Unit 3 covers data pre-processing techniques essential for data analysis, including handling missing data, data cleaning, integration, and transformation. It discusses various types of data attributes, their significance, and methods for ensuring data quality. Key tasks include data reduction, feature selection, and the application of techniques like PCA and normalization to enhance data usability.

Uploaded by

akashilay0701

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit3

Uploaded by

akashilay0701

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Unit 3: Data pre-processing:

Chapter 2 and Chapter 3 of Han-Kamber 3 rd edition

 Need Data pre-processing,

 Attributes and Data types,
 Statistical descriptions of Data,
 Handling missing Data,
 Data sampling,
 Data cleaning,
 Data Integration and transformation,
 Data reduction-– Curse of Dimensionality,
 Feature Selection and Feature Engineering,
 Principal Component Analysis (PCA),
 Discretization and generating concept hierarchies.
Data objects and attributes
❖Data sets are made up of data objects
❖Data object – entity, sample, example, instance, data point, tuple, row
❖Attribute – data field, dimension, feature, variable
❖Observation – observed value of an attribute
❖Attribute vector – Feature vector – a set of attributes used to describe
a given object

10/15/2022 Compiled by PROF. SURABHI THATTE

Types of attribute
❖Nominal = Categorical
❖Relating to names
❖Values are symbols or names of things
❖Each value represents a category, code, state
❖No meaningful order
❖Example:
1. Hair_color: black, brown, blond
2. Marital_status: single, married, divorced
3. Occupation: teacher, doctor, farmer
❖Can also be represented by numbers (1=red, 2=black)
❖No mathematical operations, no meaningful order, not quantitative
❖Possible to find mode – most commonly occurring value

10/15/2022 Compiled by PROF. SURABHI THATTE

Types of attribute
❖Binary Attributes
❖Nominal attribute with only 2 categories: 0 or 1
❖True/False, Present/Absent, Positive/Negative, Yes/No
❖Examples:
❖Diabetic: yes/no
❖Cancer: yes/no
❖Anomalous: true/false
❖Symmetric – If both states are equally valuable and carry
same weight
❖Asymmetric – If outcomes have different importance
❖Most important or rarest outcome is coded as 1
❖Example: Dengue positive: 1 , Dengue negative: 0

10/15/2022 Compiled by PROF. SURABHI THATTE

Types of attribute
❖Ordinal Attributes
❖The values have meaningful order or ranking among thme
❖Magnitude between successive values is not known
❖Example:
❖Customer_satisfaction: very satisfied, somewhat satisfied, neutral,
dissatisfied
❖Size_of_beverage: small, medium, large
❖Professional_rank: assistant professor, associate professor,
professor
❖Useful for registering subjective assessment of qualities
❖Mean cannot be defined, but median and mode can be defined
❖Qualitative attribute – actual quantity not given

10/15/2022 Compiled by PROF. SURABHI THATTE

Numeric Attributes
❖Interval-Scaled Attributes
❖Measured on the scale of equal-size units
❖Values have order and can be positive or negative
❖Difference between values can be compared and quantified
❖We cannot speak of values in terms of ratio
❖Mean, median, mode can be calculated
❖Example: Temperature, Date
❖Ratio-scaled Attributes
❖Numeric attribute with an inherent zero-point
❖Difference and ratio can be calculated
❖Mean, median, mode can be calculated
❖Example: years_of_experience, number_of_words, weight, height

10/15/2022 Compiled by PROF. SURABHI THATTE

Discrete versus Continuous
❑Discrete attribute – finite or countably infinite set of values
❑Examples: number_of_students, drink_size, customer_id, zipcode
❑Continuous attribute – real numbers, floating-point variables
❑Example: height

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Quality: Why
Preprocess the Data?
Measures for data quality: A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not, dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be understood?

◦ Refer to Han-Kamber for more details

10/15/2022 Compiled by PROF. SURABHI THATTE

Major Tasks in Data
Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
◦ Integration of multiple databases, data cubes, or files
◦ Resolving inconsistencies (customer_id vs cust_id)
Data reduction- reduced volume but same analysis result
◦ Dimensionality reduction – wavelet transform, PCA
◦ Numerosity reduction – log linear models, clusters
◦ Data compression
Data transformation and data discretization
◦ Normalization , discretization
◦ Concept hierarchy generation
*Above categorization is not mutually exclusive. Removal of redundant data is data
cleaning as well as data reduction

10/15/2022 Compiled by PROF. SURABHI THATTE

Forms of Data Preprocessing

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Cleaning

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Cleaning
Data in the Real World is Dirty
Reason: instrument faulty, human or computer error, transmission error
◦ Incomplete: lacking attribute values, lacking certain attributes of
Interest, or containing only aggregate data
◦ e.g., Occupation = “ ” (missing data)
◦ Noisy: containing noise, errors, or outliers
◦ e.g., Salary = “−10” (an error)
◦ Inconsistent: containing discrepancies in codes or names, e.g.,
◦ Age = “42”, Birthday = “03/07/2010”
◦ Was rating “1, 2, 3”, now rating “A, B, C”
◦ Discrepancy between duplicate records
◦ Intentional (e.g., disguised missing data)
◦ Jan. 1 as everyone’s birthday?

10/15/2022 Compiled by PROF. SURABHI THATTE

Incomplete (Missing) Data
▪ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
▪ Missing data may be due to
▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding / privacy issues
▪ certain data may not be considered important at the time of entry
▪ Missing data may need to be inferred
▪ Does missing value always imply error in the data? Justify.

10/15/2022 Compiled by PROF. SURABHI THATTE

How to Handle Missing Data?
❑ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably
❑ Fill in the missing value manually: tedious + infeasible?
❑ Fill it automatically with
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute mean or median
◦ the attribute mean for all samples belonging to the same class:
smarter
◦ the most probable value: inference-based such as Bayesian formula or
decision tree …by considering other attributes

10/15/2022 Compiled by PROF. SURABHI THATTE

Noisy Data
❑ What is noise?
❑ Random error, variance in a measured variable
❑ How do we identify noise?
❑ Boxplots, Scatter plots, other methods of data visualization
❑ Data Smoothing Techniques
❑ Binning
❑ Regression
❑ Outlier Analysis

10/15/2022 Compiled by PROF. SURABHI THATTE

Binning
➢ Binning methods smooth a sorted data value by consulting its
neighborhood (local smoothing)
➢ Sorted values are distributed into a number of equal – frequency
buckets (bins)
➢ Smoothing by bin means – each value of bin is replaced by mean
value of bin
➢ Smoothing by bin medians – each bin value is replaced by bin
median
➢ Smoothing by bin boundaries – minimum and maximum values in a
given bin are identified as bin boundaries. Each bin value is replaced
by closest boundary value.
Smaller /Larger the width , greater is the effect of smoothing???

10/15/2022 Compiled by PROF. SURABHI THATTE

Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
10/15/2022 Compiled by PROF. SURABHI THATTE
Regression
➢ Technique that transforms data values to a function
➢ Linear regression involves finding the best line to fit two attributes
or variables so that one can be used to predict the other
➢ Example: years of exp to predict salary…..

10/15/2022 Compiled by PROF. SURABHI THATTE

Outlier analysis
❑ Outlier can be detected by clustering
❑ Outlier detection or anomaly detection is the process of finding data
objects with behaviors that are very different than expectation
❑ Applications:
❑ Fraud detection, security, image processing, video analysis ,
intrusion detection

10/15/2022 Compiled by PROF. SURABHI THATTE

Discussion
Is concept hierarchy a form of data discretization?
Can it be used for data smoothing?

10/15/2022 Compiled by PROF. SURABHI THATTE

Tools for discrepancy detection
❑ Data scrubbing tools use simple domain knowledge (e.g. knowledge
of postal address and spell-check) to detect errors and make
corrections in the data
❑ Data auditing tools analyze data to discover rules and relationships
and detect data that violate such conditions
❑ Potter’s Wheel is a publicly available data cleaning tool that does
discrepancy detection and transformation

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Integration

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Integration
❑ Merging of data from multiple data stores
❑ Problems – redundancies and inconsistencies
❑ Challenges – matching schema and objects from different sources

10/15/2022 Compiled by PROF. SURABHI THATTE

Entity Identification Problem
❑ Problem of matching equivalent real world entities from multiple
data sources
❑ How can a data analyst be sure that customer_id from one database
and cust_number in another database refer to the same attribute?
❑ Metadata can help to avoid data integration issues
❑ Metadata for each attribute include name, meaning, data type,
range of values permitted, null rules
❑ Functional dependencies and referential constraints should be
taken care of during data integration

10/15/2022 Compiled by PROF. SURABHI THATTE

Handling Redundancy in Data Integration
❖ Redundant data occur often during integration of multiple databases
❖ Object identification: The same attribute or object may have
different names in different databases
❖ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue

❖ Redundant attributes may be able to be detected by chi-square

correlation test for nominal data, correlation coefficient test for
numeric data or covariance analysis for numeric data

10/15/2022 Compiled by PROF. SURABHI THATTE

25
Other Problems in Integration
❑ Tuple Duplication - redundancy at tuple level
❑ Denormalization is one cause of redundancy
❑ Data value conflict detection – ‘weight’ attribute may be stored in
different measurement systems in different databases
❑ Currencies and tax calculation rules are different for different
countries

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Reduction

10/15/2022 Compiled by PROF. SURABHI THATTE

Strategies
❖ Discrete Wavelet Transforms(DWT)
❖ Principal Components Analysis (PCA)
❖ Attribute subset selection
❖ Clustering
❖ Sampling
❖ Data Cube Aggregation

10/15/2022 Compiled by PROF. SURABHI THATTE

Attribute Subset Selection
▪ In multi-dimensional data, some attributes may be irrelevant to the
data mining task
▪ Example – If the task is to classify customers based on whether or
not they are likely to purchase a popular new CD at the store
▪ Relevant attributes – age, music_taste
▪ Irrelevant attributes – telephone number
▪ Domain expert can pick out relevant attributes , but time-
consuming
▪ Attribute subset selection (Feature subset selection in ML) reduces
data set size by removing irrelevant attributes

10/15/2022 Compiled by PROF. SURABHI THATTE

Finding good subset
▪ For ‘n’ attributes, there are 2n subsets
▪ Heuristic (greedy) methods are used for attribute subset selection
▪ These methods make locally optimal choice, hoping that it will lead
to global optimal solution
▪ Best attributes are decided by measures such as ‘information gain’

10/15/2022 Compiled by PROF. SURABHI THATTE

Sampling
❖ Data reduction technique
❖ Allows a large dataset to be represented by a much smaller data
sample
❖ Allows a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
❖ Key principle: Choose a representative subset of the data

10/15/2022 Compiled by PROF. SURABHI THATTE

Types of Sampling
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Cluster sample
• If tuples in D are grouped into M disjoint clusters, then a sample from
each cluster can be obtained
• Stratified sampling
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
• Example – creating a stratum for each age group

10/15/2022 Compiled by PROF. SURABHI THATTE

Sampling: with or without Replacement

Raw Data

10/15/2022 Compiled by PROF. SURABHI THATTE

Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Transformation

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Transformation
1. Data are transformed and consolidated so that the resulting mining
process is efficient.
2. Strategies for data transformation
1. Smoothing
2. Attribute construction – attribute discovery
3. Aggregation
4. Normalization
5. Discretization
6. Concept hierarchy generation

10/15/2022 Compiled by PROF. SURABHI THATTE

Normalization
➢ Normalizing the data attempts to give all attributes an equal weight
➢ For distance based methods, normalization helps prevent attributes with
initially large ranges (e.g. income) from outweighing attributes with smaller
ranges (e.g. age)
➢ It removes dependence on measurement units
➢ Normalization involves transforming the data to fall within a smaller or
common range such as [-1,1] or [0.0,1.0]
➢ Normalization is useful for algorithms like Neural Networks, or distance
based algorithms like Nearest Neighbour classification as well as Clustering
➢ Methods:
➢ Min-max normalization
➢ Z-score normalization
➢ Decimal Scaling

10/15/2022 Compiled by PROF. SURABHI THATTE

Min- Max Normalization
• Let A be a numeric attribute (e.g. income) with n observed values

• Let minA and maxA be the minmum and maximum values of A

• Min-Max Normalization maps a value v of A to v’ in the range [

new_minA, new_maxA]

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Ex. Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]

• Then $73,600 is mapped to

73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
10/15/2022 Compiled by PROF. SURABHI THATTE
Discretization
Discretization: Divide the range of a continuous attribute into
intervals
◦ Interval labels can then be used to replace actual data values
◦ Reduce data size by discretization
◦ Supervised vs. unsupervised
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an attribute
◦ Prepare for further analysis, e.g., classification

10/15/2022 Compiled by PROF. SURABHI THATTE

Extra

10/15/2022 Compiled by PROF. SURABHI THATTE

Data Wrangling
❖ Data Wrangling is the process of converting and mapping data from
its raw form to another format with the purpose of making it more
valuable and appropriate for advance tasks such as Data Analytics
and Machine Learning.
❖ Difference between Data Wrangling and ETL
❖ Users – Business Analysts vs IT employees
❖ Data – diverse, complex vs well structured
❖ Use Cases – Exploratory vs Reporting & Analysis
❖ Yet, Data Wrangling and ETL are complementary

10/15/2022 Compiled by PROF. SURABHI THATTE

Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
253777
No ratings yet
253777
66 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Week2-2
No ratings yet
Week2-2
25 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit - II
No ratings yet
Unit - II
56 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
DM Day3 Preprocessing a F24(1)
No ratings yet
DM Day3 Preprocessing a F24(1)
85 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Normalization
No ratings yet
Normalization
35 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Chapter3
No ratings yet
Chapter3
50 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
CH 3
No ratings yet
CH 3
68 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
DWM
No ratings yet
DWM
14 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Modeling of Thermal Contacts With Heat Generation: Application To Electrothermal Problems
No ratings yet
Modeling of Thermal Contacts With Heat Generation: Application To Electrothermal Problems
20 pages
Abel's Theorem
No ratings yet
Abel's Theorem
4 pages
Hxc2019 Trial p6
100% (1)
Hxc2019 Trial p6
11 pages
06th April Morning Shift_Maths_Typed
No ratings yet
06th April Morning Shift_Maths_Typed
34 pages
In Vitro Testing of The Embryo-Toxicity and Teratogenic Activity of Blumea Balsamifera (Sambong) Leaf Extract Using Zebrafish Embryo Toxicity Assay
No ratings yet
In Vitro Testing of The Embryo-Toxicity and Teratogenic Activity of Blumea Balsamifera (Sambong) Leaf Extract Using Zebrafish Embryo Toxicity Assay
3 pages
Supplement 2 G Eclipse Effec It Ve Teaching
No ratings yet
Supplement 2 G Eclipse Effec It Ve Teaching
6 pages
Alexander Macfarlane - Physical Mathematics
No ratings yet
Alexander Macfarlane - Physical Mathematics
410 pages
Geotech Chapter 10 Stress Distribution - Question
25% (4)
Geotech Chapter 10 Stress Distribution - Question
2 pages
Curves Are Regular Bends Provided in The Lines of Communication Like Roads, Railways and Canals Etc. To Bring About Gradual Change of Direction
No ratings yet
Curves Are Regular Bends Provided in The Lines of Communication Like Roads, Railways and Canals Etc. To Bring About Gradual Change of Direction
44 pages
American Invitational Mathematics Examination
No ratings yet
American Invitational Mathematics Examination
2 pages
MIDTERM
No ratings yet
MIDTERM
5 pages
KRR Unit-IV
No ratings yet
KRR Unit-IV
117 pages
Capacitor Place in Power Systems
No ratings yet
Capacitor Place in Power Systems
6 pages
Pile Set Criteria
No ratings yet
Pile Set Criteria
3 pages
Elective Math 9 Modulette Q1W4
No ratings yet
Elective Math 9 Modulette Q1W4
11 pages
Series NTH Order Derivative Mansoor Tahir PDF
No ratings yet
Series NTH Order Derivative Mansoor Tahir PDF
1 page
Mansouri Saffari 2015
No ratings yet
Mansouri Saffari 2015
16 pages
Slide1 Merged
No ratings yet
Slide1 Merged
287 pages
Python Calculator Project Updated2
No ratings yet
Python Calculator Project Updated2
11 pages
Java File Lab
No ratings yet
Java File Lab
59 pages
CV Format For MBA 1st Year
No ratings yet
CV Format For MBA 1st Year
2 pages
Machine learning notes
No ratings yet
Machine learning notes
53 pages
2018 19 Q3 Phys114 L16 Torque and Center of Gravity
No ratings yet
2018 19 Q3 Phys114 L16 Torque and Center of Gravity
21 pages
1-5 academic activities
No ratings yet
1-5 academic activities
4 pages
Jesney Poster
No ratings yet
Jesney Poster
4 pages
4.1 Logic Gate Basics: Chapter Four
No ratings yet
4.1 Logic Gate Basics: Chapter Four
18 pages
Numerical Modelling of Second Grade Fluid Flow Past A Stretching Sheet
No ratings yet
Numerical Modelling of Second Grade Fluid Flow Past A Stretching Sheet
34 pages
CUET Logical Reasoning
No ratings yet
CUET Logical Reasoning
5 pages
The Subnet Training Guide
No ratings yet
The Subnet Training Guide
83 pages

Unit3

Uploaded by

Unit3

Uploaded by

Unit 3: Data pre-processing:

Chapter 2 and Chapter 3 of Han-Kamber 3 rd edition

 Need Data pre-processing,

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

◦ Refer to Han-Kamber for more details

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

❖ Redundant attributes may be able to be detected by chi-square

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

• Let minA and maxA be the minmum and maximum values of A

• Min-Max Normalization maps a value v of A to v’ in the range [

• Then $73,600 is mapped to

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

10/15/2022 Compiled by PROF. SURABHI THATTE

You might also like