0% found this document useful (0 votes)
2 views

Data Mining Unit 1

This document provides an overview of data mining, including its definition, types of data, techniques, and major issues. It explains various data mining methods such as classification, clustering, regression, and association rules, along with the challenges faced in data mining processes. Additionally, it discusses the importance of data preprocessing, attribute types, and the technologies used in data mining.

Uploaded by

rogitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Mining Unit 1

This document provides an overview of data mining, including its definition, types of data, techniques, and major issues. It explains various data mining methods such as classification, clustering, regression, and association rules, along with the challenges faced in data mining processes. Additionally, it discusses the importance of data preprocessing, attribute types, and the technologies used in data mining.

Uploaded by

rogitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA MINING – UNIT 1

UNIT I: DATA MINING BASICS


Objective: To understand about the basics of Data Mining and Data

What is Data Mining– Kinds of Data – Kinds of patterns – Technologies


used for Data Mining– Major Issues in Data Mining– Data –Data Objects and
Attribute types– Data Visualization– Measuring Data Similarity and
Dissimilarity–Data Preprocessing– overview– Data Cleaning– Data
Integration– Data Reduction– Data Transformation and Data Discretization.

What is Data Mining ?

 The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
 Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures.
 Data mining utilizes complex mathematical algorithms for data segments and evaluates
the probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD).
 Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
 Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective.
 This process includes various types of services such as text mining, web mining, audio
and video mining, pictorial data mining, and social media mining.
 There are tonnes of information available on various platforms, but very little knowledge
is accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development.
 There are many powerful instruments and techniques available to mine data and find
better insight from it.

UNIT I - DM Page 1
DATA MINING – UNIT 1

Kinds of Data

Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
There are other kinds of data like semi-structured or unstructured data which includes
spatial data, multimedia data, text data, web data which require different methodologies for
data mining.

 Mining Multimedia Data: Multimedia data objects include image data, video data, audio
data, website hyperlinks, and linkages. Multimedia data mining tries to find out
interesting patterns from multimedia databases. This includes the processing of the digital
data and performs tasks like image processing, image classification, video, and audio data
mining, and pattern recognition.

 Mining Web Data: Web mining is essential to discover crucial patterns and knowledge
from the Web. Web content mining analyzes data of several websites which includes the
web pages and the multimedia data such as images in the web pages. Web mining is done
to understand the content of web pages, unique users of the website, unique hypertext
links, web page relevance and ranking, web page content summaries, time that the users
spent on the particular website, and understand user search patterns. Web mining also
finds out the best search engine

 Mining Text Data: Text mining is the subfield of data mining, machine learning, Natural
Language processing, and statistics. Most of the information in our daily life is stored as
text such as news articles, technical papers, books, email messages, blogs. Text Mining
helps us to retrieve high-quality information from text such as sentiment analysis,
document summarization, text categorization, text clustering. We apply machine learning
models and NLP techniques to derive useful information from the text.

 Mining Spatiotemporal Data: The data that is related to both space and time is
Spatiotemporal data. Spatiotemporal data mining retrieves interesting patterns and

UNIT I - DM Page 2
DATA MINING – UNIT 1

knowledge from spatiotemporal data. Spatiotemporal Data mining helps us to find the
value of the lands, the age of the rocks and precious stones, predict the weather patterns.
Spatiotemporal data mining has many practical applications like GPS in mobile phones,
timers, Internet-based map services, weather services, satellite, RFID, sensor.

 Mining Data Streams: Stream data is the data that can change dynamically and it is
noisy, inconsistent which contain multidimensional features of different data types. So
this data is stored in NoSql database systems. The volume of the stream data is very high
and this is the challenge for the effective mining of stream data. While mining the Data
Streams we need to perform the tasks such as clustering, outlier analysis, and the online
detection of rare events in data streams.

TECHNOLOGIES USED FOR DATA MINING

Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.

In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction, sequential
patterns, and regression.

UNIT I - DM Page 3
DATA MINING – UNIT 1

1. Classification:

This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc. The classification can also take into account, the level of user
interaction involved in the data mining procedure, such as query-driven systems,
autonomous systems, or interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. It models data by


its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework
represents a data concept. From a practical point of view, clustering plays an extraordinary job in
data mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling. For
example, we might use it to project certain costs, depending on other factors such as availability,

UNIT I - DM Page 4
DATA MINING – UNIT 1

consumer demand, and competition. Primarily it gives the exact relationship between two or
more variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of
interactions between data items within lage data sets in different types of databases. Association
rule mining has several applications and is commonly used to help sales correlations in data or
medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items being
purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.

(Confidence) / (item B)/ (Entire dataset)

o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.

(Item A + Item B) / (Entire dataset)

o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.

(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a

UNIT I - DM Page 5
DATA MINING – UNIT 1

significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

Major Issues in Data Mining

Data Mining is not very simple to understand and implement. As it is already evident that
Data Mining is a process which is very crucial for various researchers and businesses. But in data
mining, the algorithms are very complex and on top of that, the data is not readily available at
one place. Every technology has flaws or issues. But one needs to always know the various flaws
or issues that technology has.

UNIT I - DM Page 6
DATA MINING – UNIT 1

Mining Methodology and User Interaction Issues:

(i) Mining different kinds of knowledge in databases: This issue is responsible for
addressing the problems of covering a big range of data in order to meet the needs of
the client or the customer. Due to the different information or a different way, it
becomes difficult for a user to cover a big range of knowledge discovery task.

(ii) Interactive mining of knowledge at multiple levels of abstraction: Interactive


mining is very crucial because it permits the user to focus the search for patterns,
providing and refining data mining requests based on the results that were returned. In
simpler words, it allows user to focus the search on patterns from various different
angles.

(iii) Incorporation of background of knowledge: The main work of background


knowledge is to continue the process of discovery and indicate the patterns or trends
that were seen in the process. Background knowledge can also be used to express the
patterns or trends observed in brief and precise terms. It can also be represented at
different levels of abstraction.

(iv) Data mining query languages and ad hoc data mining: Data Mining Query
language is responsible for giving access to the user such that it describes ad hoc
mining tasks as well and it needs to be integrated with a data warehouse query
language.

(v) Presentation and visualization of data mining results: In this issue, the patterns or
trends that are discovered are to be rendered in high level languages and visual
representations. The representation has to be written so that it is simply understood by
everyone.

(vi) Handling noisy or incomplete data: For this process, the data cleaning methods are
used. It is a convenient way of handling the noise and the incomplete objects in data
mining. Without data cleaning methods, there will be no accuracy in the discovered

UNIT I - DM Page 7
DATA MINING – UNIT 1

patterns. And then these patterns will be poor in quality.

Performance Issues:

It has been noticed several times before also that there are performance related issues in data
mining as well. These issues are listed as follows:

(i) Efficiency and Scalability of data mining algorithm: Efficiency and Scalability is
very important when it comes to data mining process. It is also very necessary
because with the help of using this, the user can withdraw the information from the
data in a more effective and productive manner. On top of that, the user can withdraw
that information effectively from the large amount of data in various databases.

(ii) Parallel, distributed and incremental mining algorithm: There are a lot factors
which can be responsible for the development of parallel and distributed algorithms in
data mining. These factors are large in size of database, huge distribution of data, and
data mining method that are complex. In this process, the first and foremost step, the
algorithm divides the data from database into various partition. In the next step, that
data is processed such that it is situated in parallel manner. Then the last step, the
result from the partition is merged.

Diverse Data Types Issues:


The issues in this type of issue are given below:

(i) Handling of relational and complex types of data: The database may contain the
various data objects for example, complex, multimedia, temporal data, or spatial data
objects. It is very difficult to mine all these data with the help of a single system.

(ii) Mining information from heterogeneous databases and global information


systems: The problem in this kind of issue is to mine the knowledge from various
data sources. These data are not available as a single source instead these data are

UNIT I - DM Page 8
DATA MINING – UNIT 1

available at the different data sources on LAN or WAN. The structures of these data
are different as well.

What is Data ?

Data sets are made up of data objects.

 A data object represents an entity. – Also called sample, example, instance, data point,
object, tuple.
 Data objects are described by attributes.
 An attribute is a property or characteristic of a data object. – Examples: eye color of a
person, temperature, etc. – Attribute is also known as variable, field, characteristic, or
feature
 A collection of attributes describe an object.
 Attribute values are numbers or symbols assigned to an attribute.

Data Objects

Data objects are the essential part of a database. A data object represents the entity.
Data Objects are like a group of attributes of an entity. For example, a sales data object may
represent customers, sales, or purchases. When a data object is listed in a database they are
called data tuples.

UNIT I - DM Page 9
DATA MINING – UNIT 1

Attribute:

It can be seen as a data field that represents the characteristics or features of a data
object. For a customer, object attributes can be customer Id, address, etc. We can say
that a set of attributes used to describe a given object are known as attribute
vector or feature vector.

Type of attributes :

This is the First step of Data-preprocessing. We differentiate between different types of


attributes and then preprocess the data. So here is the description of attribute types.

1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).


2. Quantitative (Numeric, Discrete, Continuous)
3.

Qualitative Attributes:

1. Nominal Attributes – related to names: The values of a Nominal attribute are names
of things, some kind of symbols. Values of Nominal attributes represents some
category or state and that’s why nominal attribute also referred as categorical
attributes and there is no order (rank, position) among values of the nominal
attribute.

Example :

UNIT I - DM Page 10
DATA MINING – UNIT 1

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
 Symmetric: Both values are equally important (Gender).
 Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.

UNIT I - DM Page 11
DATA MINING – UNIT 1

Quantitative Attributes:

1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity,


represented in integer or real values. Numerical attributes are of 2 types, interval, and ratio.
 An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points.
Data can be added and subtracted at an interval scale but can not be multiplied or divided.
Consider an example of temperature in degrees Centigrade. If a day’s temperature of one
day is twice of the other day we cannot say that one day is twice as hot as another day.
 A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is
ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The
values are ordered, and we can also compute the difference between values, and the
mean, median, mode, Quantile-range, and Five number summary can be given.

2. Discrete : Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.

Example:

UNIT I - DM Page 12
DATA MINING – UNIT 1

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float


type. There can be many values between 2 and 3.

Example

Data visualization

It is actually a set of data points and information that are represented graphically to
make it easy and quick for user to understand. Data visualization is good if it has a clear
meaning, purpose, and is very easy to interpret, without requiring context. Tools of data
visualization provide an accessible way to see and understand trends, outliers, and patterns in
data by using visual effects or elements such as a chart, graphs, and maps.

Characteristics of Effective Graphical Visual :

 It shows or visualizes data very clearly in an understandable manner.


 It encourages viewers to compare different pieces of data.
 It closely integrates statistical and verbal descriptions of data set.
 It grabs our interest, focuses our mind, and keeps our eyes on message as human brain
tends to focus on visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.
 Using graphical representation, a story can be told more efficiently. Also, it requires less
time to understand picture than it takes to understand textual data.

Categories of Data Visualization :

Data visualization is very critical to market research where both numerical and
categorical data can be visualized that helps in an increase in impacts of insights and also
helps in reducing risk of analysis paralysis. So, data visualization is categorized into
following categories :

UNIT I - DM Page 13
DATA MINING – UNIT 1

Figure – Categories of Data Visualization

1. Numerical Data :

Numerical data is also known as Quantitative data. Numerical data is any data where data
generally represents amount such as height, weight, age of a person, etc. Numerical data
visualization is easiest way to visualize data. It is generally used for helping others to
digest large data sets and raw numbers in a way that makes it easier to interpret into
action. Numerical data is categorized into two categories :

 Continuous Data –
It can be narrowed or categorized (Example: Height measurements).

 Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).

The type of visualization techniques that are used to represent numerical data
visualization is Charts and Numerical Values. Examples are Pie Charts, Bar Charts,
Averages, Scorecards, etc.
2. Categorical Data :

Categorical data is also known as Qualitative data. Categorical data is any data where
data generally represents groups. It simply consists of categorical variables that are used
to represent characteristics such as a person’s ranking, a person’s gender, etc. Categorical

UNIT I - DM Page 14
DATA MINING – UNIT 1

data visualization is all about depicting key themes, establishing connections, and lending
context. Categorical data is classified into three categories :

 Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
 Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
 Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or
processes).

The type of visualization techniques that are used to represent categorical dat a is
Graphics, Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping,
Venn Diagram, etc.

Measuring Data Similarity And Dissimilarity In Data Mining

Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points is small then
there is a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application. For example, similarity
among vegetables can be determined from their taste, size, colour etc.

Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:

1. Euclidean Distance:

Euclidean distance is considered the traditional metric for problems with


geometry. It can be simply explained as the ordinary distance between two points. It
is one of the most used algorithms in the cluster analysis. One of the algorithms that
use this formula would be K-mean. Mathematically it computes the root of squared
differences between the coordinates between two objects.

UNIT I - DM Page 15
DATA MINING – UNIT 1

Figure – Euclidean Distance

2. Manhattan Distance:

This determines the absolute difference among the pair of the coordinates.

Suppose we have two points P and Q to determine the distance between these points we
simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).

Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

UNIT I - DM Page 16
DATA MINING – UNIT 1

Here the total distance of the Red line gives the Manhattan distance between both the points.

3. Jaccard Index:

The Jaccard distance measures the similarity of the two data set items as the intersection of
those items divided by the union of the data items.

Figure – Jaccard Index

UNIT I - DM Page 17
DATA MINING – UNIT 1

4. Minkowski distance:

It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-


dimensional space, a point is represented as,
(x1, x2, ..., xN)

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)


P2: (Y1, Y2, ..., YN)

Then, the Minkowski distance between P1 and P2 is given as:

 When p = 2, Minkowski distance is same as the Euclidean distance.


 When p = 1, Minkowski distance is same as the Manhattan distance.

5.Cosine Index:

Cosine distance measure for clustering determines the cosine of the angle between two
vectors given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

Figure – Cosine Distance

UNIT I - DM Page 18
DATA MINING – UNIT 1

Data Preprocessing

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.

There are the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation as follows −

Data Cleaning − Data cleaning routines operate to “clean” the information by filling in missing
values, smoothing noisy information, identifying or eliminating outliers, and resolving
deviation. If users understand the data are dirty, they are unlikely to trust the results of some
data mining that has been used.

Moreover, dirty data can make confusion for the mining phase, resulting in unstable output.
Some mining routines have some phase for dealing with incomplete or noisy information, they
are not always potent. Instead, they can concentrate on preventing overfitting the information to
the function being modeled.

Data Integration − Data integration is the procedure of merging data from several disparate
sources. While performing data integration, it must work on data redundancy, inconsistency,
duplicity, etc. In data mining, data integration is a record preprocessing method that includes
merging data from a couple of the heterogeneous data sources into coherent data to retain and
provide a unified perspective of the data.

Data integration is especially important in the healthcare industry. Integrated data from multiple
patient data and clinics assist clinicians in recognizing medical disorders and diseases by
integrating data from multiple systems into an individual perspective of beneficial data from
which beneficial insights can be derived.

Data Reduction − The objective of Data reduction is to define it more compactly. When the
data size is smaller, it is simpler to use sophisticated and computationally high-cost algorithms.
The reduction of the data can be in terms of the multiple rows (records) or terms of the multiple
columns (dimensions).

UNIT I - DM Page 19
DATA MINING – UNIT 1

In dimensionality reduction, data encoding schemes are used so as to acquire a reduced or


“compressed” description of the initial data. Examples involve data compression methods (e.g.,
wavelet transforms and principal components analysis), attribute subset selection (e.g., removing
irrelevant attributes), and attribute construction (e.g., where a small set of more beneficial
attributes is changed from the initial set).

In numerosity reduction, the data are restored by alternative, smaller description using
parametric models such as regression or log-linear models or nonparametric models such as
histograms, clusters, sampling, or data aggregation.

Data transformation − In data transformation, where data are transformed or linked into forms
applicable for mining by executing summary or aggregation operations. In Data transformation,
it includes −

Smoothing − It can work to remove noise from the data. Such techniques includes binning,
regression, and clustering.

Aggregation − In aggregation, where summary or aggregation services are used to the data. For
instance, the daily sales data can be aggregated to calculate monthly and annual total amounts.
This procedure is generally used in developing a data cube for the analysis of the records at
several granularities.

DATA CLEANING

Data cleaning is one of the important parts of machine learning. It plays a significant
part in building a model. It surely isn’t the fanciest part of machine learning and at the same
time, there aren’t any hidden tricks or secrets to uncover. However, the success or failure of a
project relies on proper data cleaning.
If we have a well-cleaned dataset, there are chances that we can get achieve good
results with simple algorithms also, which can prove very beneficial at times especially in
terms of computation when the dataset size is large.
Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.

UNIT I - DM Page 20
DATA MINING – UNIT 1

Steps involved in Data Cleaning:

1. Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate
observations most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
 Redundant observations alter the efficiency by a great extent as the data repeats and may
add towards the correct side or towards the incorrect side, thereby producing unfaithful
results.
 Irrelevant observations are any type of data that is of no use to us and can be removed
directly.

2. Fixing Structural errors

The errors that arise during measurement, transfer of data, or other similar situations are
called structural errors. Structural errors include typos in the name of features, the same
attribute with a different name, mislabeled classes, i.e. separate classes that should really be
the same, or inconsistent capitalization.
 For example, the model will treat America and America as different classes or values,
though they represent the same value or red, yellow, and red-yellow as different classes
or attributes, though one class can be included in the other two classes. So, these are
some structural errors that make our model inefficient and give poor quality results.

3. Managing Unwanted outliers

Outliers can cause problems with certain types of models. For example, linear
regression models are less robust to outliers than decision tree models. Generally, we
should not remove outliers until we have a legitimate reason to remove them.
Sometimes, removing them improves performance, sometimes not. So, one must have a
good reason to remove the outlier, such as suspicious measurements that are unlikely to
be part of real data.

4. Handling missing data

Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or
remove the missing observation. They must be handled carefully as they can be an
indication of something important. The two most common ways to deal with missing
data are:

1. Dropping observations with missing values.


 The fact that the value was missing may be informative in itself.
 Plus, in the real world, you often need to make predictions on new data even if some
of the features are missing!
2. Imputing the missing values from past observations.

UNIT I - DM Page 21
DATA MINING – UNIT 1

 Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
 Even if you build a model to impute your values, you’re not adding any real
information. You’re just reinforcing the patterns already provided by other features.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the
puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewh ere
else in the puzzle.
So, missing data is always an informative and an indication of something important. And
we must be aware of our algorithm of missing data by flagging it. By using this technique of
flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant
for missingness, instead of just filling it in with the mean.
Some data cleansing tools

 Openrefine
 Trifacta Wrangler
 TIBCO Clarity
 Cloudingo
 IBM Infosphere Quality Stage

DATA INTEGRATION

Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
These sources may include multiple data cubes, databases, or flat files.

The data integration approaches are formally defined as triple <G, S, M> where,

G stand for the global schema,

S stands for the heterogeneous source of schema,

M stands for mapping between the queries of source and global schema.

UNIT I - DM Page 22
DATA MINING – UNIT 1

There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.

Tight Coupling:

 Here, a data warehouse is treated as an information retrieval component.


 In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:

 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
 And the data only remains in the actual source databases.

Issues in Data Integration:

There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.

1. Schema Integration:

 Integrate metadata from different sources.


 The real-world entities from multiple sources are referred to as the entity identification
problem.

UNIT I - DM Page 23
DATA MINING – UNIT 1

2. Redundancy:

 An attribute may be redundant if it can be derived or obtained from another attribute or set
of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.

3. Detection and resolution of data value conflicts:

 This is the third critical issue in data integration.


 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than the “same”
attribute in another.

DATA REDUCTION

The method of data reduction may achieve a condensed description of the original data
which is much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:

These are explained as following below.

1. Data Cube Aggregation:

This technique is used to aggregate data in a simpler form. For example, imagine that
information you gathered for your analysis for the years 2012 to 2014, that data
includes the revenue of your company every three months. They involve you in the
annual sales, rather than the quarterly average, So we can summarize the data in such a
way that the resulting data summarizes the total sales per year instead of per quarter. It
summarizes the data.

2. Dimension reduction:

Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.

 Step-wise Forward Selection –

 The selection begins with an empty set of attributes later on we decide best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.

UNIT I - DM Page 24
DATA MINING – UNIT 1

Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Step-wise Backward Selection –

 This selection starts with a set of complete attributes in the original data and at each
point, it eliminates the worst remaining attribute in the set.
 Suppose there are the following attributes in the data set in which few attributes are
redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Combination of forwarding and Backward Selection –

It allows us to remove the worst and select best attributes, saving time and making the
process faster.

3. Data Compression:

The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.

 Lossless Compression –

Encoding techniques (Run Length Encoding) allows a simple and minimal data size

UNIT I - DM Page 25
DATA MINING – UNIT 1

reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.

 Lossy Compression –

Methods such as Discrete Wavelet transform technique, PCA (principal component


analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In lossy-
data compression, the decompressed data may differ to the original data but are useful
enough to retrieve information from them.

4. Numerosity Reduction:

In this reduction technique the actual data is replaced with mathematical models or
smaller representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling.

5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.

 Top-down discretization –

If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat of this method up to the end, then the process is
known as top-down discretization also known as splitting.

 Bottom-up discretization –

If you first consider all the constant values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that process is called bottom-up
discretization.

Concept Hierarchies:

It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) to high-level concepts (categorical variables such as middle age or Senior).

For numeric data following techniques can be followed:


 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.

UNIT I - DM Page 26
DATA MINING – UNIT 1

 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:

1. Equal Frequency partitioning: Partitioning the values based on their number of


occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping the similar data together.

Data Transformation

Data transformation is an essential data preprocessing technique that must be performed


on the data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline for data
analytics projects. Organizations that use on-premises data warehouses generally use an ETL
(extract, transform, and load) process, in which data transformation is the middle step. Today,
most organizations use cloud-based data warehouses to scale compute and storage resources with
latency measured in seconds or minutes. The scalability of the cloud platform lets organizations
skip preload transformations and load raw data into the data warehouse, then transform it at
query time.

Data Transformation Techniques

There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.

UNIT I - DM Page 27
DATA MINING – UNIT 1

1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look
at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see
otherwise.

We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.

o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.

o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.

o Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to construct
a new data set that eases data mining. New attributes are created and applied to assist the mining

UNIT I - DM Page 28
DATA MINING – UNIT 1

process from the given attributes. This simplifies the original data and makes the mining more
efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.

UNIT I - DM Page 29
DATA MINING – UNIT 1

o Min-max normalization: This method implements a linear transformation on the


original data. Let us consider that we have minA and maxA as the minimum and maximum
value observed for attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score
normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point depends on the maximum
absolute value of A. The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


For example, the observed values for attribute A range from -986 to 917, and the
maximum absolute value for attribute A is 986. Here, to normalize each value of attribute
UNIT I - DM Page 30
DATA MINING – UNIT 1

A using decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study and
analyze. If a data mining task handles a continuous attribute, then its discrete values can be
replaced by constant quality attributes. This improves the efficiency of the task.

This method is also called a data reduction mechanism as it transforms a large dataset into a set
of categorical data. Discretization also uses decision tree-based algorithms to produce short,
compact, and accurate results when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.

For example, the values for the age attribute can be replaced by the interval labels such as (0-10,
11-20…) or (kid, youth, adult, senior).

UNIT I - DM Page 31

You might also like