Data Mining
Data Mining
What Is an Attribute?
Nominal Attributes
Ordinal Data:
This type of data is also categorical, but with an inherent order or
hierarchy. Ordinal data represents qualitative data that can be ranked in a
particular order. For instance, education level can be ranked from primary
to tertiary, and social status can be ranked from low to high. In ordinal
data, the distance between values is not uniform. This means that it is not
possible to say that the difference between high and medium social status
is the same as the difference between medium and low social status.
Ordinal data is used in data mining for ranking and classification tasks.
Binary Data:
This type of data has only two possible values, often represented as 0 or
1. Binary data is commonly used in classification tasks, where the target
variable has only two possible outcomes. Examples of binary data include
yes/no, true/false, and pass/fail. Binary data is used in data mining for
classification and association rule mining tasks.
Interval Data:
This type of data represents quantitative data with equal intervals between
consecutive values. Interval data has no absolute zero point, and
therefore, ratios cannot be computed. Examples of interval data include
temperature, IQ scores, and time. Interval data is used in data mining for
clustering and prediction tasks.
1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can result
from various factors, such as errors during data entry or data loss during
transmission. Preprocessing techniques, such as imputation, can be used to fill in
missing values to ensure the completeness of the dataset.
2. Inconsistency:
This refers to conflicting or contradictory data in the dataset. Inconsistent data can
result from errors in data entry, data integration, or data storage. Preprocessing
techniques, such as data cleaning and data integration, can be used to detect and
resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors
during data collection or data entry. Preprocessing techniques, such as data
smoothing and outlier detection, can be used to remove noise from the dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data points in
the dataset. Outliers can result from errors in data collection, data entry, or data
transmission. Preprocessing techniques, such as outlier detection and removal, can
be used to identify and remove outliers from the dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset.
Redundant data can result from data integration or data storage. Preprocessing
techniques, such as data deduplication, can be used to remove redundant data from
the dataset.
5. Data format:
This refers to the structure and format of the data in the dataset. Data may be in
different formats, such as text, numerical, or categorical. Preprocessing techniques,
such as data transformation and normalization, can be used to convert data into a
consistent format for analysis.
Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data Integration: This involves combining data from multiple sources
to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be
used for data integration.
Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous
data into discrete categories.
This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important
information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data
reduction are: