0% found this document useful (0 votes)

22 views

Data Mining

An attribute is a data field that represents a characteristic of a data object. There are different types of attributes including nominal, ordinal, binary, and interval. Data preprocessing is an important step that involves cleaning the data by handling issues like missing values, noisy data, and transforming the data into a suitable format for analysis through techniques such as normalization, discretization, and feature selection. The goal of preprocessing is to improve data quality and prepare the data for modeling tasks like classification, clustering, and prediction.

Uploaded by

Namra Sarfraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Data Mining

Uploaded by

Namra Sarfraz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

DATA MINING

What Is an Attribute?

An attribute is a data field, representing a characteristic or feature of a data

object. The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature. The term dimension is commonly used in data
warehousing. Machine learning literature tends to use the term feature, while
statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well.

Nominal Attributes

This type of data is also referred to as categorical data. Nominal data

represents data that is qualitative and cannot be measured or compared
with numbers. In nominal data, the values represent a category, and there
is no inherent order or hierarchy. Examples of nominal data include
gender, race, religion, and occupation. Nominal data is used in data
mining for classification and clustering tasks.

Ordinal Data:
This type of data is also categorical, but with an inherent order or
hierarchy. Ordinal data represents qualitative data that can be ranked in a
particular order. For instance, education level can be ranked from primary
to tertiary, and social status can be ranked from low to high. In ordinal
data, the distance between values is not uniform. This means that it is not
possible to say that the difference between high and medium social status
is the same as the difference between medium and low social status.
Ordinal data is used in data mining for ranking and classification tasks.
Binary Data:
This type of data has only two possible values, often represented as 0 or
1. Binary data is commonly used in classification tasks, where the target
variable has only two possible outcomes. Examples of binary data include
yes/no, true/false, and pass/fail. Binary data is used in data mining for
classification and association rule mining tasks.
Interval Data:
This type of data represents quantitative data with equal intervals between
consecutive values. Interval data has no absolute zero point, and
therefore, ratios cannot be computed. Examples of interval data include
temperature, IQ scores, and time. Interval data is used in data mining for
clustering and prediction tasks.

Why do we preprocess the data?

Data preprocessing is an essential step in data mining and machine

learning as it helps to ensure the quality of data used for analysis. There
are several factors that are used for data quality assessment, including:

1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can result
from various factors, such as errors during data entry or data loss during
transmission. Preprocessing techniques, such as imputation, can be used to fill in
missing values to ensure the completeness of the dataset.
2. Inconsistency:
This refers to conflicting or contradictory data in the dataset. Inconsistent data can
result from errors in data entry, data integration, or data storage. Preprocessing
techniques, such as data cleaning and data integration, can be used to detect and
resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors
during data collection or data entry. Preprocessing techniques, such as data
smoothing and outlier detection, can be used to remove noise from the dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data points in
the dataset. Outliers can result from errors in data collection, data entry, or data
transmission. Preprocessing techniques, such as outlier detection and removal, can
be used to identify and remove outliers from the dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset.
Redundant data can result from data integration or data storage. Preprocessing
techniques, such as data deduplication, can be used to remove redundant data from
the dataset.
5. Data format:
This refers to the structure and format of the data in the dataset. Data may be in
different formats, such as text, numerical, or categorical. Preprocessing techniques,
such as data transformation and normalization, can be used to convert data into a
consistent format for analysis.

Some common steps in data preprocessing include:

 Data Cleaning: This involves identifying and correcting errors or

inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
1. Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
 Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

 Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways:
 Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

 Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

 Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
 Data Integration: This involves combining data from multiple sources
to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be
used for data integration.
 Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous
data into discrete categories.
This involves following ways:
 Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)

 Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

 Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

 Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.

 Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important
information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data
reduction are:

 Feature Selection: This involves selecting a subset of relevant features

from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).

 Feature Extraction: This involves transforming the data into a lower-

dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).

 Sampling: This involves selecting a subset of data points from the

dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.

 Clustering: This involves grouping similar data points together into

clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.

 Compression: This involves compressing the dataset while preserving

the important information. Compression is often used to reduce the size
of the dataset for storage and transmission purposes. It can be done
using techniques such as wavelet compression, JPEG compression, and
gzip compression.

Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Module 2
No ratings yet
Module 2
42 pages
Down 2
No ratings yet
Down 2
61 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Unit 3
No ratings yet
Unit 3
18 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
ml4
No ratings yet
ml4
17 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Unit 1
No ratings yet
Unit 1
8 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
DM_Midsem_Question Bank (1)
No ratings yet
DM_Midsem_Question Bank (1)
5 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
253777
No ratings yet
253777
66 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Research ethics
No ratings yet
Research ethics
2 pages
Time Table: Jawaharlal Nehru Technological University Kakinada
No ratings yet
Time Table: Jawaharlal Nehru Technological University Kakinada
3 pages
Spring 2024 - EDU302 - 1
No ratings yet
Spring 2024 - EDU302 - 1
4 pages
Abm DLP Part 3
No ratings yet
Abm DLP Part 3
10 pages
Anna University Guide List
No ratings yet
Anna University Guide List
7 pages
Pyfemm Manual
No ratings yet
Pyfemm Manual
66 pages
Paideuma 2
No ratings yet
Paideuma 2
37 pages
What Is Your Personal Opinion On Death Penalty
No ratings yet
What Is Your Personal Opinion On Death Penalty
2 pages
Surgically-Assisted Rapid Palatal Expansion For Management of Transverse Maxillary Deficiency
No ratings yet
Surgically-Assisted Rapid Palatal Expansion For Management of Transverse Maxillary Deficiency
3 pages
Alex
No ratings yet
Alex
35 pages
Image Restoration and Reconstruction
No ratings yet
Image Restoration and Reconstruction
73 pages
Cosmetology and Beauty Therapy (PDFDrive)
100% (1)
Cosmetology and Beauty Therapy (PDFDrive)
118 pages
INB Project Magazine
No ratings yet
INB Project Magazine
17 pages
Algebra 1 Review
No ratings yet
Algebra 1 Review
21 pages
Supreme Court On Stridhan
No ratings yet
Supreme Court On Stridhan
25 pages
Aonla Rejuvenation-English
No ratings yet
Aonla Rejuvenation-English
22 pages
Index
No ratings yet
Index
118 pages
Ganesha Mantras
No ratings yet
Ganesha Mantras
2 pages
5 BM Lesson Plan Week 34
No ratings yet
5 BM Lesson Plan Week 34
17 pages
A Database Management System (DBMS) Is A Software Package Designed To Define, Manipulate, Retrieve and Manage Data in A Database
No ratings yet
A Database Management System (DBMS) Is A Software Package Designed To Define, Manipulate, Retrieve and Manage Data in A Database
8 pages
CH 11 - Managing Economies of Scale in A Supply Chain - Cycle Inventory PDF
100% (1)
CH 11 - Managing Economies of Scale in A Supply Chain - Cycle Inventory PDF
39 pages
Computer Test
No ratings yet
Computer Test
3 pages
DS QlikView Connector For DataRoket en
No ratings yet
DS QlikView Connector For DataRoket en
5 pages
Actual Art, Possible Art, and Art's Definition: Gregory Currie
No ratings yet
Actual Art, Possible Art, and Art's Definition: Gregory Currie
7 pages
2021-1 Yökdi̇l
No ratings yet
2021-1 Yökdi̇l
17 pages
The Raspberry Pi Magazine - The MagPi. Issue 19
100% (1)
The Raspberry Pi Magazine - The MagPi. Issue 19
44 pages
History Final Draft On Mauryan Architecure
No ratings yet
History Final Draft On Mauryan Architecure
11 pages
Establishing Framework For Pakistan Quality Award
100% (1)
Establishing Framework For Pakistan Quality Award
11 pages
Sewing Unit 1: Basic Sewing Skills: Stage 1 Desired Results
No ratings yet
Sewing Unit 1: Basic Sewing Skills: Stage 1 Desired Results
2 pages
Sample of Notarial Will
100% (1)
Sample of Notarial Will
3 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

DATA MINING

An attribute is a data field, representing a characteristic or feature of a data

This type of data is also referred to as categorical data. Nominal data

Why do we preprocess the data?

Data preprocessing is an essential step in data mining and machine

Some common steps in data preprocessing include:

 Data Cleaning: This involves identifying and correcting errors or

 Fill the Missing values:

 Concept Hierarchy Generation:

 Feature Selection: This involves selecting a subset of relevant features

 Feature Extraction: This involves transforming the data into a lower-

 Sampling: This involves selecting a subset of data points from the

 Clustering: This involves grouping similar data points together into

 Compression: This involves compressing the dataset while preserving

You might also like