Data Mining Unit 1
Data Mining Unit 1
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures.
Data mining utilizes complex mathematical algorithms for data segments and evaluates
the probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective.
This process includes various types of services such as text mining, web mining, audio
and video mining, pictorial data mining, and social media mining.
There are tonnes of information available on various platforms, but very little knowledge
is accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development.
There are many powerful instruments and techniques available to mine data and find
better insight from it.
UNIT I - DM Page 1
DATA MINING – UNIT 1
Kinds of Data
Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
There are other kinds of data like semi-structured or unstructured data which includes
spatial data, multimedia data, text data, web data which require different methodologies for
data mining.
Mining Multimedia Data: Multimedia data objects include image data, video data, audio
data, website hyperlinks, and linkages. Multimedia data mining tries to find out
interesting patterns from multimedia databases. This includes the processing of the digital
data and performs tasks like image processing, image classification, video, and audio data
mining, and pattern recognition.
Mining Web Data: Web mining is essential to discover crucial patterns and knowledge
from the Web. Web content mining analyzes data of several websites which includes the
web pages and the multimedia data such as images in the web pages. Web mining is done
to understand the content of web pages, unique users of the website, unique hypertext
links, web page relevance and ranking, web page content summaries, time that the users
spent on the particular website, and understand user search patterns. Web mining also
finds out the best search engine
Mining Text Data: Text mining is the subfield of data mining, machine learning, Natural
Language processing, and statistics. Most of the information in our daily life is stored as
text such as news articles, technical papers, books, email messages, blogs. Text Mining
helps us to retrieve high-quality information from text such as sentiment analysis,
document summarization, text categorization, text clustering. We apply machine learning
models and NLP techniques to derive useful information from the text.
Mining Spatiotemporal Data: The data that is related to both space and time is
Spatiotemporal data. Spatiotemporal data mining retrieves interesting patterns and
UNIT I - DM Page 2
DATA MINING – UNIT 1
knowledge from spatiotemporal data. Spatiotemporal Data mining helps us to find the
value of the lands, the age of the rocks and precious stones, predict the weather patterns.
Spatiotemporal data mining has many practical applications like GPS in mobile phones,
timers, Internet-based map services, weather services, satellite, RFID, sensor.
Mining Data Streams: Stream data is the data that can change dynamically and it is
noisy, inconsistent which contain multidimensional features of different data types. So
this data is stored in NoSql database systems. The volume of the stream data is very high
and this is the challenge for the effective mining of stream data. While mining the Data
Streams we need to perform the tasks such as clustering, outlier analysis, and the online
detection of rare events in data streams.
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction, sequential
patterns, and regression.
UNIT I - DM Page 3
DATA MINING – UNIT 1
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc. The classification can also take into account, the level of user
interaction involved in the data mining procedure, such as query-driven systems,
autonomous systems, or interactive exploratory systems.
2. Clustering:
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as availability,
UNIT I - DM Page 4
DATA MINING – UNIT 1
consumer demand, and competition. Primarily it gives the exact relationship between two or
more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within lage data sets in different types of databases. Association
rule mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
UNIT I - DM Page 5
DATA MINING – UNIT 1
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
Data Mining is not very simple to understand and implement. As it is already evident that
Data Mining is a process which is very crucial for various researchers and businesses. But in data
mining, the algorithms are very complex and on top of that, the data is not readily available at
one place. Every technology has flaws or issues. But one needs to always know the various flaws
or issues that technology has.
UNIT I - DM Page 6
DATA MINING – UNIT 1
(i) Mining different kinds of knowledge in databases: This issue is responsible for
addressing the problems of covering a big range of data in order to meet the needs of
the client or the customer. Due to the different information or a different way, it
becomes difficult for a user to cover a big range of knowledge discovery task.
(iv) Data mining query languages and ad hoc data mining: Data Mining Query
language is responsible for giving access to the user such that it describes ad hoc
mining tasks as well and it needs to be integrated with a data warehouse query
language.
(v) Presentation and visualization of data mining results: In this issue, the patterns or
trends that are discovered are to be rendered in high level languages and visual
representations. The representation has to be written so that it is simply understood by
everyone.
(vi) Handling noisy or incomplete data: For this process, the data cleaning methods are
used. It is a convenient way of handling the noise and the incomplete objects in data
mining. Without data cleaning methods, there will be no accuracy in the discovered
UNIT I - DM Page 7
DATA MINING – UNIT 1
Performance Issues:
It has been noticed several times before also that there are performance related issues in data
mining as well. These issues are listed as follows:
(i) Efficiency and Scalability of data mining algorithm: Efficiency and Scalability is
very important when it comes to data mining process. It is also very necessary
because with the help of using this, the user can withdraw the information from the
data in a more effective and productive manner. On top of that, the user can withdraw
that information effectively from the large amount of data in various databases.
(ii) Parallel, distributed and incremental mining algorithm: There are a lot factors
which can be responsible for the development of parallel and distributed algorithms in
data mining. These factors are large in size of database, huge distribution of data, and
data mining method that are complex. In this process, the first and foremost step, the
algorithm divides the data from database into various partition. In the next step, that
data is processed such that it is situated in parallel manner. Then the last step, the
result from the partition is merged.
(i) Handling of relational and complex types of data: The database may contain the
various data objects for example, complex, multimedia, temporal data, or spatial data
objects. It is very difficult to mine all these data with the help of a single system.
UNIT I - DM Page 8
DATA MINING – UNIT 1
available at the different data sources on LAN or WAN. The structures of these data
are different as well.
What is Data ?
A data object represents an entity. – Also called sample, example, instance, data point,
object, tuple.
Data objects are described by attributes.
An attribute is a property or characteristic of a data object. – Examples: eye color of a
person, temperature, etc. – Attribute is also known as variable, field, characteristic, or
feature
A collection of attributes describe an object.
Attribute values are numbers or symbols assigned to an attribute.
Data Objects
Data objects are the essential part of a database. A data object represents the entity.
Data Objects are like a group of attributes of an entity. For example, a sales data object may
represent customers, sales, or purchases. When a data object is listed in a database they are
called data tuples.
UNIT I - DM Page 9
DATA MINING – UNIT 1
Attribute:
It can be seen as a data field that represents the characteristics or features of a data
object. For a customer, object attributes can be customer Id, address, etc. We can say
that a set of attributes used to describe a given object are known as attribute
vector or feature vector.
Type of attributes :
Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names
of things, some kind of symbols. Values of Nominal attributes represents some
category or state and that’s why nominal attribute also referred as categorical
attributes and there is no order (rank, position) among values of the nominal
attribute.
Example :
UNIT I - DM Page 10
DATA MINING – UNIT 1
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.
UNIT I - DM Page 11
DATA MINING – UNIT 1
Quantitative Attributes:
2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example:
UNIT I - DM Page 12
DATA MINING – UNIT 1
Example
Data visualization
It is actually a set of data points and information that are represented graphically to
make it easy and quick for user to understand. Data visualization is good if it has a clear
meaning, purpose, and is very easy to interpret, without requiring context. Tools of data
visualization provide an accessible way to see and understand trends, outliers, and patterns in
data by using visual effects or elements such as a chart, graphs, and maps.
Data visualization is very critical to market research where both numerical and
categorical data can be visualized that helps in an increase in impacts of insights and also
helps in reducing risk of analysis paralysis. So, data visualization is categorized into
following categories :
UNIT I - DM Page 13
DATA MINING – UNIT 1
1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data
generally represents amount such as height, weight, age of a person, etc. Numerical data
visualization is easiest way to visualize data. It is generally used for helping others to
digest large data sets and raw numbers in a way that makes it easier to interpret into
action. Numerical data is categorized into two categories :
Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).
The type of visualization techniques that are used to represent numerical data
visualization is Charts and Numerical Values. Examples are Pie Charts, Bar Charts,
Averages, Scorecards, etc.
2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where
data generally represents groups. It simply consists of categorical variables that are used
to represent characteristics such as a person’s ranking, a person’s gender, etc. Categorical
UNIT I - DM Page 14
DATA MINING – UNIT 1
data visualization is all about depicting key themes, establishing connections, and lending
context. Categorical data is classified into three categories :
Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or
processes).
The type of visualization techniques that are used to represent categorical dat a is
Graphics, Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping,
Venn Diagram, etc.
Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points is small then
there is a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application. For example, similarity
among vegetables can be determined from their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:
1. Euclidean Distance:
UNIT I - DM Page 15
DATA MINING – UNIT 1
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between these points we
simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
UNIT I - DM Page 16
DATA MINING – UNIT 1
Here the total distance of the Red line gives the Manhattan distance between both the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the intersection of
those items divided by the union of the data items.
UNIT I - DM Page 17
DATA MINING – UNIT 1
4. Minkowski distance:
5.Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle between two
vectors given by the following formula.
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.
UNIT I - DM Page 18
DATA MINING – UNIT 1
Data Preprocessing
Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.
There are the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation as follows −
Data Cleaning − Data cleaning routines operate to “clean” the information by filling in missing
values, smoothing noisy information, identifying or eliminating outliers, and resolving
deviation. If users understand the data are dirty, they are unlikely to trust the results of some
data mining that has been used.
Moreover, dirty data can make confusion for the mining phase, resulting in unstable output.
Some mining routines have some phase for dealing with incomplete or noisy information, they
are not always potent. Instead, they can concentrate on preventing overfitting the information to
the function being modeled.
Data Integration − Data integration is the procedure of merging data from several disparate
sources. While performing data integration, it must work on data redundancy, inconsistency,
duplicity, etc. In data mining, data integration is a record preprocessing method that includes
merging data from a couple of the heterogeneous data sources into coherent data to retain and
provide a unified perspective of the data.
Data integration is especially important in the healthcare industry. Integrated data from multiple
patient data and clinics assist clinicians in recognizing medical disorders and diseases by
integrating data from multiple systems into an individual perspective of beneficial data from
which beneficial insights can be derived.
Data Reduction − The objective of Data reduction is to define it more compactly. When the
data size is smaller, it is simpler to use sophisticated and computationally high-cost algorithms.
The reduction of the data can be in terms of the multiple rows (records) or terms of the multiple
columns (dimensions).
UNIT I - DM Page 19
DATA MINING – UNIT 1
In numerosity reduction, the data are restored by alternative, smaller description using
parametric models such as regression or log-linear models or nonparametric models such as
histograms, clusters, sampling, or data aggregation.
Data transformation − In data transformation, where data are transformed or linked into forms
applicable for mining by executing summary or aggregation operations. In Data transformation,
it includes −
Smoothing − It can work to remove noise from the data. Such techniques includes binning,
regression, and clustering.
Aggregation − In aggregation, where summary or aggregation services are used to the data. For
instance, the daily sales data can be aggregated to calculate monthly and annual total amounts.
This procedure is generally used in developing a data cube for the analysis of the records at
several granularities.
DATA CLEANING
Data cleaning is one of the important parts of machine learning. It plays a significant
part in building a model. It surely isn’t the fanciest part of machine learning and at the same
time, there aren’t any hidden tricks or secrets to uncover. However, the success or failure of a
project relies on proper data cleaning.
If we have a well-cleaned dataset, there are chances that we can get achieve good
results with simple algorithms also, which can prove very beneficial at times especially in
terms of computation when the dataset size is large.
Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.
UNIT I - DM Page 20
DATA MINING – UNIT 1
This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate
observations most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
Redundant observations alter the efficiency by a great extent as the data repeats and may
add towards the correct side or towards the incorrect side, thereby producing unfaithful
results.
Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
The errors that arise during measurement, transfer of data, or other similar situations are
called structural errors. Structural errors include typos in the name of features, the same
attribute with a different name, mislabeled classes, i.e. separate classes that should really be
the same, or inconsistent capitalization.
For example, the model will treat America and America as different classes or values,
though they represent the same value or red, yellow, and red-yellow as different classes
or attributes, though one class can be included in the other two classes. So, these are
some structural errors that make our model inefficient and give poor quality results.
Outliers can cause problems with certain types of models. For example, linear
regression models are less robust to outliers than decision tree models. Generally, we
should not remove outliers until we have a legitimate reason to remove them.
Sometimes, removing them improves performance, sometimes not. So, one must have a
good reason to remove the outlier, such as suspicious measurements that are unlikely to
be part of real data.
Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or
remove the missing observation. They must be handled carefully as they can be an
indication of something important. The two most common ways to deal with missing
data are:
UNIT I - DM Page 21
DATA MINING – UNIT 1
Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real
information. You’re just reinforcing the patterns already provided by other features.
Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the
puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewh ere
else in the puzzle.
So, missing data is always an informative and an indication of something important. And
we must be aware of our algorithm of missing data by flagging it. By using this technique of
flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant
for missingness, instead of just filling it in with the mean.
Some data cleansing tools
Openrefine
Trifacta Wrangler
TIBCO Clarity
Cloudingo
IBM Infosphere Quality Stage
DATA INTEGRATION
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
M stands for mapping between the queries of source and global schema.
UNIT I - DM Page 22
DATA MINING – UNIT 1
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.
Tight Coupling:
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
And the data only remains in the actual source databases.
There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
UNIT I - DM Page 23
DATA MINING – UNIT 1
2. Redundancy:
An attribute may be redundant if it can be derived or obtained from another attribute or set
of attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
DATA REDUCTION
The method of data reduction may achieve a condensed description of the original data
which is much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:
This technique is used to aggregate data in a simpler form. For example, imagine that
information you gathered for your analysis for the years 2012 to 2014, that data
includes the revenue of your company every three months. They involve you in the
annual sales, rather than the quarterly average, So we can summarize the data in such a
way that the resulting data summarizes the total sales per year instead of per quarter. It
summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
UNIT I - DM Page 24
DATA MINING – UNIT 1
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
This selection starts with a set of complete attributes in the original data and at each
point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
It allows us to remove the worst and select best attributes, saving time and making the
process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size
UNIT I - DM Page 25
DATA MINING – UNIT 1
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
Lossy Compression –
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or
smaller representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling.
Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat of this method up to the end, then the process is
known as top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) to high-level concepts (categorical variables such as middle age or Senior).
UNIT I - DM Page 26
DATA MINING – UNIT 1
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
Data Transformation
Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline for data
analytics projects. Organizations that use on-premises data warehouses generally use an ETL
(extract, transform, and load) process, in which data transformation is the middle step. Today,
most organizations use cloud-based data warehouses to scale compute and storage resources with
latency measured in seconds or minutes. The scalability of the cloud platform lets organizations
skip preload transformations and load raw data into the data warehouse, then transform it at
query time.
There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
UNIT I - DM Page 27
DATA MINING – UNIT 1
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look
at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see
otherwise.
We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to construct
a new data set that eases data mining. New attributes are created and applied to assist the mining
UNIT I - DM Page 28
DATA MINING – UNIT 1
process from the given attributes. This simplifies the original data and makes the mining more
efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.
UNIT I - DM Page 29
DATA MINING – UNIT 1
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score
normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point depends on the maximum
absolute value of A. The formula for the decimal scaling is given below:
A using decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study and
analyze. If a data mining task handles a continuous attribute, then its discrete values can be
replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set
of categorical data. Discretization also uses decision tree-based algorithms to produce short,
compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10,
11-20…) or (kid, youth, adult, senior).
UNIT I - DM Page 31