0% found this document useful (0 votes)
22 views

Data Mining

An attribute is a data field that represents a characteristic of a data object. There are different types of attributes including nominal, ordinal, binary, and interval. Data preprocessing is an important step that involves cleaning the data by handling issues like missing values, noisy data, and transforming the data into a suitable format for analysis through techniques such as normalization, discretization, and feature selection. The goal of preprocessing is to improve data quality and prepare the data for modeling tasks like classification, clustering, and prediction.

Uploaded by

Namra Sarfraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Mining

An attribute is a data field that represents a characteristic of a data object. There are different types of attributes including nominal, ordinal, binary, and interval. Data preprocessing is an important step that involves cleaning the data by handling issues like missing values, noisy data, and transforming the data into a suitable format for analysis through techniques such as normalization, discretization, and feature selection. The goal of preprocessing is to improve data quality and prepare the data for modeling tasks like classification, clustering, and prediction.

Uploaded by

Namra Sarfraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

DATA MINING

What Is an Attribute?

An attribute is a data field, representing a characteristic or feature of a data


object. The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature. The term dimension is commonly used in data
warehousing. Machine learning literature tends to use the term feature, while
statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well.

Nominal Attributes

This type of data is also referred to as categorical data. Nominal data


represents data that is qualitative and cannot be measured or compared
with numbers. In nominal data, the values represent a category, and there
is no inherent order or hierarchy. Examples of nominal data include
gender, race, religion, and occupation. Nominal data is used in data
mining for classification and clustering tasks.

Ordinal Data:
This type of data is also categorical, but with an inherent order or
hierarchy. Ordinal data represents qualitative data that can be ranked in a
particular order. For instance, education level can be ranked from primary
to tertiary, and social status can be ranked from low to high. In ordinal
data, the distance between values is not uniform. This means that it is not
possible to say that the difference between high and medium social status
is the same as the difference between medium and low social status.
Ordinal data is used in data mining for ranking and classification tasks.
Binary Data:
This type of data has only two possible values, often represented as 0 or
1. Binary data is commonly used in classification tasks, where the target
variable has only two possible outcomes. Examples of binary data include
yes/no, true/false, and pass/fail. Binary data is used in data mining for
classification and association rule mining tasks.
Interval Data:
This type of data represents quantitative data with equal intervals between
consecutive values. Interval data has no absolute zero point, and
therefore, ratios cannot be computed. Examples of interval data include
temperature, IQ scores, and time. Interval data is used in data mining for
clustering and prediction tasks.

Why do we preprocess the data?

Data preprocessing is an essential step in data mining and machine


learning as it helps to ensure the quality of data used for analysis. There
are several factors that are used for data quality assessment, including:

1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can result
from various factors, such as errors during data entry or data loss during
transmission. Preprocessing techniques, such as imputation, can be used to fill in
missing values to ensure the completeness of the dataset.
2. Inconsistency:
This refers to conflicting or contradictory data in the dataset. Inconsistent data can
result from errors in data entry, data integration, or data storage. Preprocessing
techniques, such as data cleaning and data integration, can be used to detect and
resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors
during data collection or data entry. Preprocessing techniques, such as data
smoothing and outlier detection, can be used to remove noise from the dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data points in
the dataset. Outliers can result from errors in data collection, data entry, or data
transmission. Preprocessing techniques, such as outlier detection and removal, can
be used to identify and remove outliers from the dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset.
Redundant data can result from data integration or data storage. Preprocessing
techniques, such as data deduplication, can be used to remove redundant data from
the dataset.
5. Data format:
This refers to the structure and format of the data in the dataset. Data may be in
different formats, such as text, numerical, or categorical. Preprocessing techniques,
such as data transformation and normalization, can be used to convert data into a
consistent format for analysis.

Some common steps in data preprocessing include:

 Data Cleaning: This involves identifying and correcting errors or


inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
1. Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
 Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

 Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways:
 Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

 Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

 Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
 Data Integration: This involves combining data from multiple sources
to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be
used for data integration.
 Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous
data into discrete categories.
This involves following ways:
 Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)

 Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

 Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

 Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.

 Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important
information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data
reduction are:

 Feature Selection: This involves selecting a subset of relevant features


from the dataset. Feature selection is often performed to remove
irrelevant or redundant features from the dataset. It can be done using
various techniques such as correlation analysis, mutual information, and
principal component analysis (PCA).

 Feature Extraction: This involves transforming the data into a lower-


dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional
and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).

 Sampling: This involves selecting a subset of data points from the


dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.

 Clustering: This involves grouping similar data points together into


clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be done
using techniques such as k-means, hierarchical clustering, and density-
based clustering.

 Compression: This involves compressing the dataset while preserving


the important information. Compression is often used to reduce the size
of the dataset for storage and transmission purposes. It can be done
using techniques such as wavelet compression, JPEG compression, and
gzip compression.

You might also like