0% found this document useful (0 votes)
193 views116 pages

BCA Data Mining

The document provides an overview of data mining tasks and techniques. It discusses descriptive tasks like characterization and discrimination that analyze general data properties without prior assumptions. Predictive tasks include classification, regression, and prediction to infer patterns for forecasting. Association analysis and clustering are also covered. Specific techniques explained are decision trees, rules, neural networks, frequent pattern mining and correlation analysis. The goal of the document is to introduce common data mining concepts and tasks.

Uploaded by

sridharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views116 pages

BCA Data Mining

The document provides an overview of data mining tasks and techniques. It discusses descriptive tasks like characterization and discrimination that analyze general data properties without prior assumptions. Predictive tasks include classification, regression, and prediction to infer patterns for forecasting. Association analysis and clustering are also covered. Specific techniques explained are decision trees, rules, neural networks, frequent pattern mining and correlation analysis. The goal of the document is to introduce common data mining concepts and tasks.

Uploaded by

sridharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

SRI VIDYA MANDIR ARTS AND SCIENCE COLLEGE

(AUTONOMOUS)

DEPARTMENT OF COMPUTER SCIENCE AND APPLICATIONS

DATA MINING

Name: _________________________________________

Class: _________________________________________
Syllabus
UNIT – I
Introduction: Data Mining Tasks – Data Mining Versus Knowledge Discovery in Databases –
Relational Databases – Data Warehouses – Transactional Databases – Object Oriented
Databases – Spatial Databases – Temporal Databases – Text and Multimedia Databases –
Heterogeneous Databases – Mining Issues – Metrics – Social Implications of Data Mining.
UNIT – II
Data Preprocessing: Why Preprocess the Data – Data Cleaning – Data Integration – Data
Transformation – Data Reduction – Data Discretization.
UNIT – III
Data Mining Techniques: Association Rule Mining – The Apriori Algorithm – Multilevel
Association Rules – Multidimensional Association Rules – Constraint Based Association
Mining.
UNIT – IV
Classification and Prediction: Issues Regarding Classification and Prediction – Decision Tree
Induction – Bayesian Classification – Back Propagation – Classification Methods –
Prediction – Classifiers Accuracy.
UNIT – V
Clustering Techniques: Cluster Analysis – Clustering Methods – Hierarchical Methods –
Density Based Methods – Outlier Analysis – Introduction to Advanced Topics: Web Mining,
Spatial Mining and Temporal Mining.
Text Book
1. J. Han and M. Kamber, ―Data Mining: Concepts and Techniques‖, Morgan
Kaufmann, New Delhi, 2001.
Reference Books
1. M.H. Dunham, ―Data Mining: Introductory and Advanced Topics‖, Pearson
Education, Delhi, 2003.
2. Paulraj Ponnaiah, ―Data Warehousing Fundamentals‖, Wiley Publishers, 2001.
3. S.N. Sivananda and S. Sumathi, ―Data Mining‖, Thomsan Learning, Chennai, 2006.
Data Mining III- BCA

1. Introduction

DATA MINING: Knowledge mining from database

Data mining is the process of automatically discovering interesting patterns and


knowledge from large amounts of data.

Data Mining can be defined as the process of extracting important or relevant


information from a set of raw data.

Data mining is also called as KDD (Knowledge Discovery in Database).

Formal Definition

Data mining is a collection of techniques for efficient automated discovery of previously


unknown, valid, novel, useful and understandable patterns in large databases.

The patterns must be actionable so that they may be used in an enterprise‘s decision-making
process.

1. 1 DATA MINING TASKS


1.1.1 Definition
Data mining tasks are the kind of data patterns that can be mined.
Data mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks.
1.1.2 Classifications of Data Mining Tasks

Figure 1.1 Data Mining Tasks

PG & Research Department Computer Science and Applications Page 2


Data Mining III- BCA

In general, data mining tasks can be classified into two categories


 Descriptive
 Predictive
Descriptive data mining
Descriptive mining tasks characterize the general properties of the data in a target
data set.
Descriptive data mining demonstrates the common characteristics in the results. It
offers knowledge of the data and gives insight into what's going on inside the data without
any prior idea.

Predictive Data Mining


Predictive mining tasks perform induction (inferences) on the current data in order to
make predictions.
Predictive data mining provides prediction features from data to its users.
1.1.3 Key Data Mining Tasks
Class/Concept Description: Characterization and Discrimination
Data entries can be associated with classes or concepts.
It can be useful to describe individual classes and concepts in summarized, concise,
and yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions.
These descriptions can be derived by the following two ways,
Data characterization is a summarization of the general characteristics or features of
a target class of data. The data corresponding to the user-specified class are typically
collected by a query.
Data Discrimination is referring to the mapping or classification of a class with some
predefined group or class.
To distinguish the inaccessible data items, it uses regression analysis and detects the
missing numeric values in the data. If the class mark is absent, classification is used to render
the prediction.
Mining Frequent Patterns
One of the functions of data mining is finding data patterns.
Frequent patterns are patterns that occur frequently in transactional data.
There are many kinds of frequent patterns
 Frequent itemsets

PG & Research Department Computer Science and Applications Page 3


Data Mining III- BCA

 Frequent subsequences (also known as sequential patterns)


 Frequent substructures
Frequent item set
A frequent itemset refers to a set of items that frequently appear together in a
transactional data set.
for example, milk and bread, which are frequently bought together in grocery stores
by many customers.
Frequent Subsequence
A sequence of patterns that occur frequently such as purchasing a camera is followed
by memory card.

Frequent substructure
A substructure can refers to the various types of data structures that can be combined
with an item set or subsequences, such as trees and graphs.
If a substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is
also known as Market Basket Analysis for its wide use in retail sales.
Two parameters are used for determining the association rules:
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs when another item
occurs in a transaction.
Correlation Analysis
Correlation is a mathematical technique that can show whether and how strongly the
pairs of attributes are related to each other.
For example, Highted people tend to have more weight.
It is a kind of additional analysis performed to uncover interesting statistical
correlations between two item sets to analyze that if they have positive, negative or no effect
on each other.
Cluster Analysis
Cluster refers to a group of similar kind of objects.

PG & Research Department Computer Science and Applications Page 4


Data Mining III- BCA

Cluster analysis refers to forming group of objects that are very similar to each other
but are highly different from the objects in other clusters.
Summarization
Summarization is the generalization of data. A set of relevant data is summarized
which result in a smaller set that gives aggregated information of the data.
For example, the shopping done by a customer can be summarized into total products,
total spending, offers used, etc
Sequence Discovery
Sequence discovery or sequential pattern mining, is a data mining technique that is
used to find relevant and important patterns in sequential data.

Classification and Regression for Predictive Analysis


Classification
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
Classification derives a model to determine the class of an object based on its
attributes.
The model is derived based on the analysis of a set of training data (i.e., data objects
for which the class labels are known).
The model is used to predict the class label of objects for which the the class label is
unknown.
A classification model can be represented in various forms
 IF-THEN rules
 Decision tree
 Neural network
IF-THEN rules

Decision Tree
A decision tree is a flowchart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions.

PG & Research Department Computer Science and Applications Page 5


Data Mining III- BCA

Decision trees can easily be converted to classification rules.

Figure 1.2 Decision Tree


Neural network
A neural network, when used for classification, is typically a collection of neuron-like
processing units with weighted connections between the units.

Figure1.3 Neural Network

There are many other methods for constructing classification models, such as Naive
Bayesian classification, support vector machines, and k-nearest-neighbor classification.

Regression
Regression is learning a function which maps a data item to a real-valued prediction
variable.
Regression is used to predict missing or unavailable numerical data values rather than
(discrete) class labels. The term prediction refers to both numeric prediction and class label
prediction.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.

PG & Research Department Computer Science and Applications Page 6


Data Mining III- BCA

Time Series Analysis


Time series is a sequence of events where the next event is determined by one or more
of the preceding events.
Time series reflects the process being measured and there are certain components that
affect the behaviour of a process.
Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics.
Stock market prediction is an important application of time-series analysis.
Prediction
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in predicting
future values of a new data set of interest.
For example, a model can predict the income of an employee based on education,
experience and other demographic factors like place of stay, gender etc.

Also prediction analysis is used in different areas including medical diagnosis, fraud
detection etc.

1.2 DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

KDD (Knowledge Discovery in Databases) is a field of computer science, which


includes the tools and theories to help in extracting useful and previously unknown
information (i.e., knowledge) from large collections of digitized data.

KDD consists of several steps, and Data Mining is one of them. Data Mining is the
application of a specific algorithm to extract patterns from data. However, KDD and Data
Mining are used interchangeably.

1.2.1 What is KDD?

KDD is a computer science field specializing in extracting previously unknown and


interesting information from raw data.

KDD is the whole process of trying to make sense of data by developing appropriate
methods or techniques.

This is achieved by creating short reports, modeling the process of generating data,
and developing predictive models that can predict future cases.

PG & Research Department Computer Science and Applications Page 7


Data Mining III- BCA

Due to the exponential growth of data, especially in areas such as business, KDD has
become a very important process to convert this large wealth of data into business
intelligence, as manual extraction of patterns has become seemingly impossible in the past
few decades.

For example, it is currently used for various applications such as social network
analysis, fraud detection, science, investment, manufacturing, telecommunications, data
cleaning, sports, information retrieval, and marketing.

1.2.2 KDD Process Steps

Figure 1. 4 Data mining as a step in the process of knowledge discovery

The knowledge discovery process includes an iterative sequence of the following


steps:

 Data cleaning: It is a phase to remove noise and inconsistent data from the collection.
 Data integration where multiple data sources may be combined
 Data selection where data relevant to the analysis task are retrieved from the database
 Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
 Data mining: it is an essential process where intelligent methods are applied to
extract data patterns
 Pattern evaluation: in this step, strictly interesting patterns that representing
knowledge are identified based on given measures.

PG & Research Department Computer Science and Applications Page 8


Data Mining III- BCA

 Knowledge representation: is the final phase where visualization and knowledge


representation techniques are used to present mined knowledge to users. This essential
step uses visualization techniques to help users understand and interpret the data
mining results.

1.2.3 Difference between KDD and Data Mining


Although the two terms KDD and Data Mining are heavily used interchangeably, they
refer to two related yet slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a
step inside the KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall
goal of the KDD process.
KDD is an iterative process where evaluation measures can be enhanced, mining can
be refined, and new data can be integrated and transformed to get different and more
appropriate results.

1.3 WHAT KINDS OF DATA CAN BE MINED?

As a general technology, data mining can be applied to any kind of data as long as the
data are meaningful for a target application. The most basic forms of data for mining
applications are relational databases, object-relational databases and object-oriented
databases, data warehouse data, and transactional data. Data mining can also be applied to
other forms of data e.g., data streams, ordered/sequence data, graph or networked data, spatial
data, text data, multimedia data, and the WWW.

1.3.1 Relational Database

A Relational database is defined as the collection of data organized in tables with


rows and columns.

Briefly, a relational database is a collection of tables, each of which is assigned a


unique name. Each tables consists columns and rows, where columns represent attributes and
rows or records represent tuples.

Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.

PG & Research Department Computer Science and Applications Page 9


Data Mining III- BCA

Figure Relational Database

Tables convey and share information, which facilitates data searchability, reporting,
and organization.

Database Schema

A database schema is the skeleton structure that represents the logical view of the
entire database. It defines how the data is organized and how the relations among them are
associated. It formulates all the constraints that are to be applied on the data.

A database schema defines its entities and the relationship among them. It contains a
descriptive detail of the database.

Example of Relational schema for a relational database

 customer (cust ID, name, address, age, occupation, annual income, credit
information, category, . . .)
 item (item ID, brand, category, type, price, place made, supplier, cost, . . .)
 employee (empl ID, name, category, group, salary, commission, . . .)

A database schema can be divided broadly into two categories such as,

 Physical schema in Relational databases is a schema which defines the structure


of tables.

PG & Research Department Computer Science and Applications Page


10
Data Mining III- BCA

 Logical schema in Relational databases is a schema which defines the


relationship among tables.

A semantic data model, such as an entity-relationship (ER) data model, is often


constructed for relational databases. An ER data model represents the database as a set of
entities and their relationships.

SQL

Relational data can be accessed by database queries written in a relational query


language called SQL.

Standard API of relational database is SQL.

The most commonly used query language for relational database is SQL, which
allows retrieval and manipulation of the data stored in the tables, as well as the calculation of
aggregate functions such as average, sum, min, max and count.

For instance, an SQL query to select the videos grouped by category would be:
SELECT count(*) FROM Items WHERE type=video GROUP BY category.

Data mining algorithms using relational databases can be more versatile, since they
can take advantage of the structure inherent to relational databases. While data mining can
benefit from SQL for data selection, transformation and consolidation, it goes beyond what
SQL could provide, such as predicting, comparing, detecting deviations, etc.

Application: Data Mining, ROLAP model, etc.

Relational databases are one of the most commonly available and richest information
repositories, and thus they are a major data form in the study of data mining.

1.3.2 Data warehouse

A data warehouse is a repository of information collected from multiple sources,


stored under a unified schema, and usually residing at a single site.

Data warehouse is an integrated subject-oriented and time-variant repository of


information in support of management‘s decision-making process.
PG & Research Department Computer Science and Applications Page
11
Data Mining III- BCA

Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.

In other words, data from the different stores would be loaded, cleaned, transformed
and integrated together. To facilitate decision-making and multi-dimensional views, data
warehouses are usually modelled by a multi-dimensional data structure.

Characteristics of Data warehouse

 Subject-Oriented: A data warehouse is subject-oriented since it provides topic-wise


information rather than the overall processes of a business.
 Integrated: A data warehouse is developed by integrating data from varied sources into
a consistent format.
 Time-variant: The different data present in the data warehouse provides information for
a specific period.
 Non-volatile: Data once entered into a data warehouse must remain unchanged. All data
is read-only. Previous data is not erased when current data is entered.

Structure of data warehouse system

There are three types of Datawarehouse

 Enterprise data warehouse (EDW) is a system for structuring and storing


all company's business data for analytics querying and reporting.
 Data Mart is a smaller form of data warehouse, which serves some specific
needs on data analysis.
 A virtual warehouse is a set of views over an operational database for
efficient query processing.

PG & Research Department Computer Science and Applications Page


12
Data Mining III- BCA

DW Architecture

Key components of a data warehouse

A typical data warehouse has four main components are

 Central database
 ETL (extract, transform, load) tools
 Metadata
 Access tools

It involves extracting information from source system by using an ETL process and
then storing the information in a staging database. The daily changes also come to the staging
area.

Another ETL process is used to transform information from the staging area to
populate the ODS.

Then ODS is used for supplying information via another ETL process to the data
warehouse which in turn feeds a number of data marts that generate the reports required by
management.

The data in a data warehouse are organized around major subjects to facilitate
decision making. The data are stored to provide information from a historical perspective and
are typically summarized.

PG & Research Department Computer Science and Applications Page


13
Data Mining III- BCA

Data Cube

A data warehouse is usually modelled by a multidimensional data structure, called a


data cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure such as count or sum.

A data cube provides a multidimensional view of data and allows the precomputation
and fast access of summarized data.

Example : A cube represented by,

Country x Degree x Semester

Figure 1.7 A multidimensional data cube

Data Cube Operations

A number of operations may be applied to data cubes. They are,

 Roll-up
 Drill-down
 Slice and dice
 Pivot

Roll-up
Roll-up is like zooming out on the data cube.

It is required when the user needs further abstraction or less detail.

This operation performs further aggregations on the data.

PG & Research Department Computer Science and Applications Page


14
Data Mining III- BCA

Drill-down

Drill-down is like zooming in on the data and is therefore the reverse of roll-up.

It is an appropriate operation when the user needs further details or when the user wants to
partition more finely or wants to focus on some particular values of certain dimensions.

Slice and dice


Slice and dice are operations for browsing the data in the cube.

A slice is a subset of the cube corresponding to a single value for one or more members of the
dimensions.

A dice operation is similar to slice but dicing does not involve reducing the number of
dimensions.

A dice is obtained by performing a selection on two or more dimensions.

Pivot or Rotate

The pivot operation is used when the user wishes to re-orient the view of the data cube.

It may involve swapping the rows and columns or moving one of the row dimensions into the
column dimension.

For example,

The cube consisting of dimension degree along the x-axis, country along the y-axis and
starting semester along z-axis or vertical axis.

Although data warehouse tools help support data analysis, additional tools for data
mining are often needed for in-depth analysis.

Multidimensional data mining (also called exploratory multidimensional data mining)


performs data mining in multidimensional space in an OLAP style. That is, it allows the
exploration of multiple combinations of dimensions at varying levels of granularity in data
mining, and thus has greater potential for discovering interesting patterns representing
knowledge.

PG & Research Department Computer Science and Applications Page


15
Data Mining III- BCA

Benefits of data warehouse

 Provide a single version of truth about enterprise information.


 Speed up ad hoc reports and queries
 Improved data consistency
 Better business decisions
 Easier access to enterprise data for end-users
 Better documentation of data
 Reduced computer costs and higher productivity

1.3.3 Transactional Databases

A transaction database is a set of records representing transactions, each with a time


stamp, an identifier and a set of items.

A transactional database includes a file where each record defines a transaction. A


transaction generally contains a unique transaction identity number (trans ID) and a list of the
items creating up the transaction (such as items purchased in a store).

Associated with the transaction files could also be descriptive data for the items.
represents the transaction database.

In general, each record in a transactional database captures a transaction, such as a


customer‘s purchase, a flight booking, or a user‘s clicks on a web page.

A transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the transaction.

Transaction ID Items

100 Bread, Egg

200 Bread, Egg, Juice

300 Bread, Milk

400 Egg, Juice, Milk

PG & Research Department Computer Science and Applications Page


16
Data Mining III- BCA

A transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the salesperson or the
branch, and so on.

Transactions are usually stored in flat files or stored in two normalized transaction
tables, one for the transactions and one for the transaction items.

One typical data mining analysis on such data is the so-called market basket analysis
or association rules in which associations between items occurring together or in sequence
are studied.

This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.

Highly flexible system where users can modify information without changing any
sensitive information.

Follows ACID property of DBMS that is all four ACID qualities are enforced through
transactions: Atomicity, Consistency, Isolation, and Durability.

Key features of Transactional Database

 Data Accuracy
 Flexibility
 Speed
 Keeping Track of Operating Systems

Application: Banking, Distributed systems, Object databases, etc.

1.3.4 Object Oriented Databases

An object-oriented database is a collection of object-oriented programming and


relational database. There are various items which are created using object-oriented
programming languages like C++, Java which can be stored in relational databases, but
object-oriented databases are well-suited for those items.

PG & Research Department Computer Science and Applications Page


17
Data Mining III- BCA

An object-oriented database is organized around objects rather than actions, and data
rather than logic. For example, a multimedia record in a relational database can be a definable
data object, as opposed to an alphanumeric value.

Object-oriented databases are a type of database management system. Different


database management systems provide additional functionalities. Object-oriented databases
add the database functionality to object programming languages, creating more manageable
code bases.

Building blocks (Components) of an object-oriented database

Object-oriented databases contain the following foundational elements:

 Objects are the basic building block and an instance of a class, where the type is either
built-in or user-defined.

 Classes are a grouping of all objects with the same properties and behaviors. It provides
a schema or blueprint for objects, defining the behavior

 Methods determine the behavior of a class.

 Attributes Objects also have properties (attributes) like name, status, and createdate.

 Pointers are addresses that facilitate in accessing elements of an object database and
establish relations between objects.

Figure Object-Oriented Database

PG & Research Department Computer Science and Applications Page


18
Data Mining III- BCA

Object-Oriented Programming Concepts

Object-oriented databases closely relate to object-oriented programming concepts.

The four main ideas of object-oriented programming are:

 Polymorphism

 Inheritance

 Encapsulation

 Abstraction

 Polymorphism is the capability of an object to take multiple forms. This ability allows
the same program code to work with different data types.

 Inheritance creates a hierarchical relationship between related classes while making


parts of code reusable.

 Encapsulation is the ability to group data and mechanisms into a single object to
provide access protection. Through this process, pieces of information and details of how
an object works are hidden, resulting in data and function security.

 Abstraction is the procedure of representing only the essential data features for the
needed functionality. The process selects vital information while unnecessary
information stays hidden. Abstraction helps reduce the complexity of modeled data and
allows reusability.

Object-Oriented Database Examples

There are different kinds of implementations of object databases. Most contain the
following features:

 Query Language - Language to find objects and retrieve data from the database.

 Transparent Persistence - Ability to use an object-oriented programming language for


data manipulation.

 ACID Transactions - ACID transactions guarantee all transactions are complete without
conflicting changes.

PG & Research Department Computer Science and Applications Page


19
Data Mining III- BCA

 Database Caching - Creates a partial replica of the database. Allows access to a


database from program memory instead of a disk.

 Recovery - Disaster recovery in case of application or system failure.

1.3.5 Spatial Databases

Spatial Databases which store geographical information in the form of coordinates,


topology, lines, polygons, etc.

A spatial database saves a huge amount of space-related data, including maps,


preprocessed remote sensing or medical imaging records, and VLSI chip design data.

Spatial data is associated with geographic locations such as cities,towns etc. A spatial
database is optimized to store and query data representing objects. These are the objects
which are defined in a geometric space.

Characteristics of Spatial Database

A spatial database system has the following characteristics

 It is a database system

 It offers spatial data types (SDTs) in its data model and query language.

 It supports spatial data types in its implementation, providing at least spatial


indexing and efficient algorithms for spatial join.

Features of Spatial databases

Spatial databases have several features that distinguish them from relational
databases such as,

 They carry topological and/or distance information

 Usually organized by sophisticated

 Multidimensional spatial indexing structures that are accessed by spatial data access
methods

 Require spatial reasoning

PG & Research Department Computer Science and Applications Page


20
Data Mining III- BCA

 Geometric computation

 Spatial knowledge representation techniques

Spatial data mining

Spatial data mining refers to the extraction of knowledge, spatial relationships, or


other interesting patterns not explicitly stored in spatial databases. Such mining demands the
unification of data mining with spatial database technologies.

It can be used for learning spatial records, discovering spatial relationships and
relationships among spatial and nonspatial records, constructing spatial knowledge bases,
reorganizing spatial databases, and optimizing spatial queries.

Applications

 Maps
 Global positioning
 Marketing
 Remote sensing
 Image database exploration
 Medical imaging
 Navigation
 Traffic control
 Environmental studies

1.3.6 Temporal Databases

A temporal database is generally understood as a database capable of supporting


storage and reasoning of time-based data.

Temporal databases store temporal data, i.e. data that is time dependent (time varying).

A Temporal Database is a database with built-in support for handling time


sensitive data. Usually, databases store information only about current state, and not about
past states.

PG & Research Department Computer Science and Applications Page


21
Data Mining III- BCA

For example in an employee database if the address or salary of a particular person


changes, the database gets updated, the old value is no longer there.

However for many applications, it is important to maintain the past or historical


values and the time at which the data was updated. That is, the knowledge of evolution is
required. That is where temporal databases are useful. It stores information about the past,
present and future.

Any data that is time dependent is called the temporal data and these are stored in
temporal databases.

Temporal Databases store information about states of the real world across time.
Temporal Database is a database with built-in support for handling data involving time. It
stores information relating to past, present and future time of all events.

Typical temporal database scenarios and applications include time-dependent/time-


varying economic data, such as:

 Share prices

 Exchange rates

 Interest rates

 Company profits

Temporal data are sequences of a primary data type, usually numerical values, and it
deals with gathering useful knowledge from temporal data.

Examples of Temporal Databases

 Healthcare Systems: Doctors need the patients‘ health history for proper diagnosis.
Information like the time a vaccination was given or the exact time when fever goes
high etc.
 Insurance Systems: Information about claims, accident history, time when policies
are in effect needs to be maintained.
 Reservation Systems: Date and time of all reservations is important.

PG & Research Department Computer Science and Applications Page


22
Data Mining III- BCA

Temporal Aspects

There are two different aspects of time in temporal databases.

 Valid Time: Time period during which a fact is true in real world, provided to the
system.
 Transaction Time: Time period during which a fact is stored in the database, based
on transaction serialization order and is the timestamp generated automatically by the
system.

Temporal Relation

Temporal Relation is one where each tuple has associated time; either valid time or
transaction time or both associated with it.

 Uni-Temporal Relations: Has one axis of time, either Valid


Time or Transaction Time.
 Bi-Temporal Relations: Has both axis of time – Valid time and Transaction
time. It includes Valid Start Time, Valid End Time, Transaction Start
Time, Transaction End Time.

Products Using Temporal Databases


The popular products that use temporal databases include:

 Oracle
 Microsoft SQL Server. (Read more about SQL Server‘s Temporal Tables)
 IBM DB2

Features of temporal databases

Temporal databases support managing and accessing temporal data by providing one
or more of the following features:

 A time period datatype, including the ability to represent time periods with no end
(infinity or forever)
 The ability to define valid and transaction time period attributes and bitemporal
relations
 System-maintained transaction time

PG & Research Department Computer Science and Applications Page


23
Data Mining III- BCA

 Temporal primary keys, including non-overlapping period constraints


 Temporal constraints, including non-overlapping uniqueness and referential
integrity
 Update and deletion of temporal records with automatic splitting and coalescing
of time periods
 Temporal queries at current time, time points in the past or future, or over
durations

1.3.7 Text And Multimedia Databases

Text databases

Text databases consist of huge collection of documents. These text documents are
collected from several sources such as news articles, books, digital libraries, e-mail messages,
web pages, etc. Due to increase in the amount of information, the text databases are growing
rapidly. In many of the text databases, the data is semi-structured.

For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains
unstructured text components, such as abstract and contents.

Without knowing what could be in the documents, it is difficult to formulate effective


queries for analyzing and extracting useful information from the data. Users require tools to
compare the documents and rank their importance and relevance. Therefore, text mining has
become popular and an essential theme in data mining.

Information Retrieval

Information retrieval deals with the retrieval of information from a large number of
text-based documents. Some of the database systems are not usually present in information
retrieval systems because both handle different kinds of data.

Examples of information retrieval system include,

 Online Library catalogue system


 Online Document Management Systems
 Web Search Systems etc.
PG & Research Department Computer Science and Applications Page
24
Data Mining III- BCA

Information Filtering

The main problem in an information retrieval system is to locate relevant documents


in a document collection based on a user's query. This kind of user's query consists of some
keywords describing an information need.

In such search problems, the user takes an initiative to pull relevant information out
from a collection. This is appropriate when the user has ad-hoc information need, i.e., a short-
term need. But if the user has a long-term information need, then the retrieval system can also
take an initiative to push any newly arrived information item to the user.

This kind of access to information is called Information Filtering. And the


corresponding systems are known as Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval

We need to check the accuracy of a system when it retrieves a number of documents


on the basis of user's input. Let the set of documents relevant to a query be denoted as
{Relevant} and the set of retrieved document as {Retrieved}. The set of documents that are
relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}.

Multimedia Databases

Multimedia databases include video, images, audio and text media. They can be
stored on extended object-relational or object-oriented databases, or simply on a file system.
This data is stored in the form of multiple file types like .txt(text), .jpg(images),
.swf(videos), .mp3(audio) etc.

Multimedia is characterized by its high dimensionality, which makes data mining


even more challenging. Data mining from multimedia repositories may require computer

PG & Research Department Computer Science and Applications Page


25
Data Mining III- BCA

vision, computer graphics, image interpretation, and natural language processing


methodologies.

The multimedia database can be classified into three types. These types are:

 Static media
 Dynamic media
 Dimensional media

Contents of the Multimedia Database


The multimedia database stored the multimedia data and information related to it.
This is given in detail as follows

Media data
This is the multimedia data that is stored in the database such as images, videos,
audios, animation etc.

Media format data

The Media format data contains the formatting information related to the media data
such as sampling rate, frame rate, encoding scheme etc.

Media keyword data

This contains the keyword data related to the media in the database. For an image the
keyword data can be date and time of the image, description of the image etc.

PG & Research Department Computer Science and Applications Page


26
Data Mining III- BCA

Media feature data

The Media feature data describes the features of the media data. For an image, feature
data can be colours of the image, textures in the image etc.

Challenges of Multimedia Database

There are many challenges to implement a multimedia database. Some of these are:

 Multimedia databases contains data in a large type of formats such as .txt(text),


.jpg(images), .swf(videos), .mp3(audio) etc. It is difficult to convert one type of data
format to another.

 The multimedia database requires a large size as the multimedia data is quite large
and needs to be stored successfully in the database.

 It takes a lot of time to process multimedia data so multimedia database is slow.

Multimedia Database Applications:

 Documents and record management: Industries which keep a lot of documentation


and records. Ex: Insurance claim industry.
 Knowledge dissemination: Multimedia database is an extremely efficient tool for
knowledge dissemination and providing several resources. Ex: electronic books
 Education and training: Multimedia sources can be used to create resources useful
in education and training. These are popular sources of learning in recent days. Ex:
Digital libraries.
 Real-time monitoring and control: Multimedia presentation when coupled with
active database technology can be an effective means for controlling and monitoring
complex tasks. Ex: Manufacture control
 Marketing
 Advertisement
 Retailing
 Entertainment

PG & Research Department Computer Science and Applications Page


27
Data Mining III- BCA

1.3.8 Heterogeneous Databases


In a heterogeneous distributed database, different sites have different operating
systems, DBMS products and data models.
Its properties are
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in
processing user requests.
Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in nature and


integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.

1.4 MINING ISSUES


Data mining is a dynamic and fast-expanding field with great strengths.
Data mining is the practice of drawing solutions from data-based insights in the form
of patterns, models, or algorithms. Doing this is certainly not easy. It requires the joint effort
of data scientists, researchers, translators, and analysts to make it possible.

Data Mining Issues


The major issues in data mining research, partitioning them into five groups are,
 Mining Methodology
 User Interaction
 Performance
 Different Data Types
 Data Security & Privacy
1. Mining Methodology Issues
Researchers have been dynamically developing new data mining methodologies. This
involves the investigation of new kinds of knowledge, mining in multidimensional space,
integrating methods from other disciplines.
PG & Research Department Computer Science and Applications Page
28
Data Mining III- BCA

In addition, mining methodologies should consider issues such as data uncertainty,


noise, and incompleteness. Some mining methods explore how user specified measures can
be used to assess the interestingness of discovered patterns as well as guide the discovery
process.
Various aspects of mining methodology
 Mining various and new kinds of knowledge
Data mining covers a wide spectrum of data analysis and knowledge discovery tasks,
association, correlation analysis, classification, regression, clustering, outlier analysis,
sequence analysis, and trend and evolution analysis. These tasks may use the same database
in different ways and require the development of numerous data mining techniques.
 Mining knowledge in multidimensional space
When searching for knowledge in large data sets, we can explore the data in
multidimensional space. That is, we can search for interesting patterns among combinations
of dimensions (attributes) at varying levels of abstraction. Such mining is known as
(exploratory) multidimensional data mining. Mining knowledge in multidimensional data
cube can substantially enhance the power and flexibility of data mining.
 Data mining - an interdisciplinary effort
The power of data mining can be substantially enhanced by integrating new methods
from multiple disciplines.
 Boosting the power of discovery in a networked environment
Most data objects reside in a linked or interconnected environment, whether it be the
Web, database relations, files, or documents. Semantic links across multiple data objects can
be used to advantage in data mining.
 Handling uncertainty, noise, or incompleteness of data
Data often contain noise, errors, exceptions, or uncertainty, or are incomplete. Errors
and noise may confuse the data mining process. Data cleaning, data preprocessing, outlier
detection and removal, and uncertainty reasoning are examples of techniques that need to be
integrated with the data mining process. Dealing with the noises, incomplete information, and
errors can be a big challenge.
 Pattern evaluation and pattern - or constraint - guided mining
Not all the patterns generated by data mining processes are interesting. Pattern
interesting may vary from user to user. Therefore, techniques are needed to assess the
interestingness of discovered patterns based on subjective measures.

PG & Research Department Computer Science and Applications Page


29
Data Mining III- BCA

2. User Interaction Issues


The user plays an important role in the data mining process. Interesting areas of
research include,
 Interactive mining
The data mining process should be highly interactive. Thus, it is important to build
flexible user interfaces and an exploratory mining environment, facilitating the user‘s
interaction with the system. Interactive mining should allow users to dynamically change the
focus of a search, to refine mining requests based on returned results.
 Incorporation of background knowledge
The background knowledge can be used to guide discovery process and to express the
discovered patterns. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of abstraction.
 Ad hoc data mining and data mining query languages
Data Mining Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and optimized for efficient and
flexible data mining.
 Presentation and visualization of data mining results
Once the patterns are discovered it needs to be expressed in high level languages, and
visual representations. These representations should be easily understandable.
2. Performance Issues
It is necessary that the modelling should be flexible and qualified through quality tests.
Here, the performance issues must be acknowledged.
 Efficiency and scalability of data mining algorithms
In order to effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
 Parallel, Distributed, and Incremental mining Algorithms
The factors such as huge size of databases, wide distribution of data, and complexity
of data mining methods motivate the development of parallel and distributed data mining
algorithms. The incremental algorithms, update databases without mining the data again from
scratch.
3. Diverse Data Types Issues
Data has many faces. We may find it in its visual, audio, text, and numeric forms.
Processing these different types and then, mining may be difficult.

PG & Research Department Computer Science and Applications Page


30
Data Mining III- BCA

 Dealing with Relational and Complex Types of Data


The database may contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kinds of data.

 Mining Information from Heterogeneous Databases


The data is available at different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured. Therefore, mining the knowledge from
them adds challenges to data mining.
4. Data Security & Privacy Issues
Personally, identifiable data is sensitive and people don‘t like to share it with anyone.
Here, a threat to its privacy & security can be a reason to seriously think.

 Security
Mostly, data are shared over the internet, the cloud, and servers to ensure their access
24X7 remotely. This access can be dangerous if it is done through a public network, which is
not secure.

 Privacy Concerns
Dynamic techniques are adopted to collect information from diverse resources,
especially from data subjects. This collection is not risk-free, as they carry personally
identifiable information. Hackers tend to break in and take away these credentials.

1.5 DATA MINING METRICS

Data Mining has emerged at the confluence of artificial intelligence, statistics, and
databases as a technique for automatically discovering summary knowledge in large datasets.

Data mining metrics may be defined as a set of measurements which can help in
determining the efficacy of a Data mining Method / Technique or Algorithm. They are
important to help take the right decision as like as choosing the right data mining technique or
algorithm.

In many cases, a single metric may not be sufficient to evaluate. In such cases, we
might have multiple metrics which can be used to validate one another and maximize the
accuracy of the evaluation.

PG & Research Department Computer Science and Applications Page


31
Data Mining III- BCA

Data mining metrics generally fall into the categories of


 Accuracy
 Reliability
 Usefulness
Accuracy
Accuracy is a measure of how well the model correlates an outcome with the
attributes in the data that has been provided.
There are various measures of accuracy, but all measures of accuracy are dependent
on the data that is used.

Usefulness
Usefulness involves several metrics that tell us whether the model provides useful
data. For instance, a data mining model that correlates save the location with sales can be
both accurate and reliable, but cannot be useful, because it cannot generalize that result by
inserting more stores at the same location.

Reliability
Reliability assesses the way that a data mining model performs on different data sets.
A data mining model is reliable if it generates the same type of predictions or finds the same
general kinds of patterns regardless of the test data that is supplied.

Return on Investment (ROI)


From an overall business or usefulness perspective, a measure such as Return on
Investment (ROI) could be used. ROI examines the difference between what the data mining
technique costs and what the savings or benefits from its use are. Of course, this would be
difficult to measure because the return is hard to quantify. It could be measured as increased
sales, reduced advertising expenditure, or both.
Evaluation metrics play a critical role in data mining. Metrics are used to guide the
data mining algorithms and to evaluate the results of data mining.

1.6 SOCIAL IMPLICATIONS OF DATA MINING

Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern recognition
technologies including statistical and mathematical techniques.

PG & Research Department Computer Science and Applications Page


32
Data Mining III- BCA

Data mining systems are designed to promote the identification and classification of
individuals into different groups or segments. From the aspect of the commercial firm, and
possibly for the industry as a whole, it can interpret the use of data mining as a discriminatory
technology in the rational search of profits.

There are various social implications of data mining which are as follows

Privacy

In current years privacy concerns have taken on a more important role in society as
merchants, insurance companies, and government agencies amass warehouses including
personal records.

The concerns that people have over the group of this data will generally extend to
some analytic capabilities used to the data. Users of data mining should start thinking about
how their use of this technology will be impacted by legal problems associated with privacy.

Profiling

Data Mining and profiling is a developing field that attempts to organize, understand,
analyze, reason, and use the explosion of data in this information age. The process contains
using algorithms and experience to extract design or anomalies that are very complex,
difficult, or time-consuming to recognize.

Unauthorized Use

Trends obtain through data mining designed to be used for marketing goals or some
other ethical goals, can be misused. Unethical businesses or people can use the data obtained
through data mining to take benefit of vulnerable people or discriminate against a specific
group of people.

PG & Research Department Computer Science and Applications Page


33
Data Mining III- BCA

2. Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable


format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.

2.1 WHY PREPROCESS THE DATA

Data preprocessing is the key step to identifying the missing key values,
inconsistencies, and noise, containing errors and outliers.
Without data preprocessing in data science, these data errors would survive and
lower the quality of data mining.
Data pre-processing is the preliminary step to clean the data, improve the data quality,
and also adapt better data mining techniques and tools.
Data have quality if they satisfy the requirements of the intended use. There are many
factors comprising data quality, including accuracy, completeness, consistency, timeliness,
believability, and interpretability.
We carefully inspect the company‘s database and data warehouse, identifying and
selecting the attributes or dimensions to be included in analysis.
Users of database system have reported errors, unusual values, and inconsistencies in
the data recorded for some transactions.
In other words, the data that wish to analyze by data mining techniques are
incomplete (lacking attribute values), inaccurate or noisy (containing errors, or values that
deviate from the expected), and inconsistent.
Three of the elements defining data quality are,
 Accuracy
 Completeness
 Consistency
Inaccurate, incomplete, and inconsistent data are common place properties of large
real-world databases and data warehouses.

PG & Research Department Computer Science and Applications Page


34
Data Mining III- BCA

Many possible reasons for inaccurate data (i.e., having incorrect attribute values)
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at data entry.
 Users may purposely submit incorrect data values for mandatory fields when
they do not wish to submit personal information. This is known as disguised
missing data.
 Errors in data transmission can also occur.
 There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.
Incomplete data can occur for a number of reasons
 Attributes of interest may not always be available
 Relevant data may not be recorded due to a misunderstanding
 Negligence
 Deliberate avoidance for privacy
 Ambiguity of the survey question
Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the data history or modifications may have been overlooked.
Missing data, particularly for tuples with missing values for some attributes, may need
to be inferred.
Preprocessing of data is mainly to check the data quality. The quality can be checked
by the following
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that does or do
not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.
2.1.1 Major Tasks in Data Preprocessing
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

PG & Research Department Computer Science and Applications Page


35
Data Mining III- BCA

Figure 2.1 Data Preprocessing Steps

 Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
 Data integration involves integrating multiple databases, data cubes, or files.
 Dimensionality reduction is a process helps in the reduction of the volume of the
data which makes the analysis easier.
 The change made in the format or the structure of the data is called data
transformation.

2.2 DATA CLEANING


Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values.
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
Basic methods for data cleaning are,
 Handling missing values
 Data smoothing techniques
 Approaches to data cleaning as a process
2.2.1 Handling missing values
Missing values are filled with appropriate values. Standard values like ―Not
Available‖ or ―NA‖ can be used to replace the missing values.

PG & Research Department Computer Science and Applications Page


36
Data Mining III- BCA

Missing values can also be filled manually but it is not recommended when that
dataset is big.
The attribute‘s mean value can be used to replace the missing value when the data is
normally distributed wherein in the case of non-normal distribution median value of the
attribute can be used.
While using regression or decision tree algorithms the missing value can be replaced
by the most probable value.
Approaches to fill the values are,
 Ignore the tuple: The tuple is ignored when it includes several attributes with
missing values. This is usually done when the class label is missing. This method is
not very effective, unless the tuple contains several attributes with missing values.
 Fill in the missing value manually: The values are filled manually for the missing
value. In general, this approach is time consuming and may not be feasible given a
large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing attribute
values by the same global constant such as a label like ―Unknown‖ or −∞.
 Use the attribute mean or median to fill in the missing value: The attribute mean
can fill the missing values, which indicate the ―middle‖ value of a data distribution.
For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median.
 Use the most probable value to fill in the missing value: The most probable value
can fill the missing values that may be determined with regression, inference-based
tools using a Bayesian formalism, or decision tree induction.

2.2.2 Noisy Data


Noise is a random error or variance in a measured variable which containing
unnecessary data points.
Some basic statistical description techniques (e.g., boxplots and scatter plots), and
methods of data visualization can be used to identify outliers, which may represent noise.
Some smoothing methods to handle noise which are as follows,
 Binning
 Regression
 Clustering or Outlier analysis

PG & Research Department Computer Science and Applications Page


37
Data Mining III- BCA

Binning
This method is to smooth or handle noisy data. First, the data is sorted then and then
the sorted values are separated and stored in the form of bins.
These methods smooth out a arrange data value by consulting its ―neighborhood,‖
especially, the values around the noisy information. The arranged values are distributed into
multiple buckets or bins. Because binning methods consult the neighborhood of values, they
implement local smoothing.
There are three methods for smoothing data in the bin.
 Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin;
 Smoothing by bin median: In this method, the values in the bin are replaced
by the median value;
 Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by the
closest boundary value.

Figure 2.2 Binning methods for data smoothing


Regression
Data smoothing can also be done by regression, a technique that conforms data values
to a function.
This is used to smooth the data and will help to handle data when unnecessary data is
present. For the analysis, regression helps to decide the variable which is suitable for
analysis.
Linear regression involves finding the ―best‖ line to fit two attributes (or variables) so
that one attribute can be used to predict the other.
PG & Research Department Computer Science and Applications Page
38
Data Mining III- BCA

Multiple linear regressions is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
Clustering or Outlier analysis
Clustering supports in identifying the outliers. The same values are organized into
clusters and those values which fall outside the cluster are known as outliers. Clustering is
generally used in unsupervised learning.
2.2.3 Data Cleaning as a Process
The first step in data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors are,
 Poorly designed data entry forms that have many optional fields
 Human error in data entry
 Deliberate errors (e.g., respondents not wanting to divulge information about
themselves),
 Data decay (e.g., outdated addresses).
 Discrepancies may also arise from inconsistent data representations and
inconsistent use of codes.
 Errors in instrumentation devices that record data and system errors.
Different tools for discrepancy detection
There are a number of different commercial tools that can aid in the discrepancy
detection step.
 Data scrubbing tools use simple domain knowledge to detect errors and make
corrections in the data. These tools rely on parsing and fuzzy matching techniques
when cleaning data from multiple sources.
 Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions.
Some data inconsistencies may be corrected manually using external references. Most
errors, however, will require data transformations.
Commercial tools can assist in the data transformation step are,
 Data migration tools allow simple transformations to be specified such as to
replace the string ―gender‖ by ―sex.‖
 ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).

PG & Research Department Computer Science and Applications Page


39
Data Mining III- BCA

2.3 DATA INTEGRATION


The Data integration process is one of the main components in data management.
Data integration is the process which merging of data from multiple data sources.
These sources may include multiple data cubes, databases, or flat files.

Figure 2.3 Data Integration

Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the subsequent data
mining process.
The statistical integration strategy is formally stated as a triple (G, S, M)
approach. G represents the global schema, S represents the heterogeneous source of schema,
and M represents the mapping between source and global schema queries.

2.3.1 Data Integration Approaches


There are mainly two types of approaches for data integration. These are as follows:
 Tight Coupling
 Loose Coupling
Tight Coupling
It is the process of using ETL (Extraction, Transformation, and Loading) to
combine data from various sources into a single physical location.
Loose Coupling
This approach provides an interface that gets a query from the user, changes it into a
format that the supply database may understand, and then sends the query to the source
databases without delay to obtain the result.

PG & Research Department Computer Science and Applications Page


40
Data Mining III- BCA

2.3.2 Issues in Data Integration


When integrate the data in Data Mining, We may face many issues. There are
Entity Identification Problem
There are a number of issues to consider during data integration. Schema integration
and object matching can be tricky. How can equivalent real-world entities from multiple
data sources be matched up? This is referred to as the entity identification problem.
Matching the real-world identities from the data becomes a problem since data is
collected from heterogeneous sources. Analyzing the metadata information of an attribute
prevents errors in schema integration.
For example,
How can the data analyst or the computer be sure that customer id in one database and
customer number in another refer to the same attribute? Examples of metadata for each
attribute include the name, meaning, data type, and range of values permitted for the attribute,
and null rules for handling blank, zero, or null values. Such metadata can be used to help
avoid errors in schema integration. The metadata may also be used to help transform the data.
Ensuring that the functional dependency of an attribute in the source system and its
referential constraints match with the functional dependency and referential constraint of the
same attribute in the target system can achieve structural integration.
Redundancy and Correlation Analysis
One of the big issues during data integration is redundancy. These redundant and
unimportant data are no longer needed and can arise due to attributes that can be derived
using another attribute in the data set.
The level of redundancy can also be raised by the inconsistencies in attributes and can
be discovered using correlation analysis. Here, the attributes are analyzed to detect their
interdependence on each other, thus being able to detect the correlation between them.
Triple Duplication
Data integration also has to deal with duplicated tuples. These may become a part of
the resultant data if a denormalized table is used as the source for data integration.
Data Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts.
Data conflict happens when the data merged from various sources do not match. This
could be caused by varying attribute values in different data sets. It could also be caused by

PG & Research Department Computer Science and Applications Page


41
Data Mining III- BCA

different representations in different data sets. Issues such as this are meant to be detected and
resolved in data integration.

2.3.3 Data Integration Techniques


 Manual Integration
 Middleware Integration
 Application-based integration
 Uniform Access Integration
 Data Warehousing
Manual Integration
This method avoids using automation during data integration. The data analyst
collects, cleans, and integrates the data to produce meaningful information. This strategy is
suitable for a mini organization with a limited data set. Although, it will be time-consuming
for the huge, sophisticated, and recurring integration. Because the entire process must be
done manually, it is a time-consuming operation.
Middleware Integration
The middleware software is used to take data from many sources, normalize it, and
store it in the resulting data set. When an enterprise needs to integrate data from legacy
systems to modern systems, this technique is used.
Application-based integration
It is using software applications to extract, transform, and load data from disparate
sources. This strategy saves time and effort, but it is a little more complicated.
Uniform Access Integration
This method combines data from a more disparate source. However, the data's
position is not altered in this scenario; the data stays in its original location. This technique
merely generates a unified view of the integrated data.
Data Warehousing
This technique is related to the uniform access integration technique in a roundabout
way. The unified view, on the other hand, is stored in a different location. It enables the data
analyst to deal with more sophisticated inquiries.

PG & Research Department Computer Science and Applications Page


42
Data Mining III- BCA

2.4 DATA TRANSFORMATION


In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
Strategies for data transformation include the following:
 Smoothing: Smoothing which works to remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a
simple change that helps in prediction. Techniques include binning, regression, and
clustering.
 Attribute construction (or feature construction), where new attributes are
constructed and added from the given set of attributes to help the mining process.
 Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data depends
on the quantity and quality of the data. When the quality and the quantity of the data
are good the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can set an interval
like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented in a
smaller range. Example ranging from -1.0 to 1.0.
 Concept hierarchy generation for nominal data, where attributes such as street can
be generalized to higher-level concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema and can be automatically
defined at the schema definition level.
Data Transformation by Normalization
The data should be normalized or standardized to help avoid dependence on the
choice of measurement units. This involves transforming the data to fall within a smaller or
common range such as [−1, 1] or [0.0, 1.0].
The measurement unit used can affect the data analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight,
may lead to very different results.

PG & Research Department Computer Science and Applications Page


43
Data Mining III- BCA

Normalizing the data attempts to give all attributes an equal weight.


Normalization is particularly useful for classification algorithms involving neural
networks or distance measurements such as nearest-neighbor classification and clustering.
If using the neural network backpropagation algorithm for classification mining,
normalizing the input values for each attribute measured in the training tuples will help speed
up the learning phase.
For distance-based methods, normalization helps prevent attributes with initially
large ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g.,
binary attributes). It is also useful when given no prior knowledge of the data.
There are many methods for data normalization. They are,
 min-max normalization
 z-score normalization
 normalization by decimal scaling
Min-max normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute, A. Min-
max normalization maps a value, vi , of A to v‘i in the range [new minA, new maxA] by
computing

Min-max normalization preserves the relationships among the original data values.
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean (i.e., average) and standard deviation of A. A value, v i ,

of A is normalized to v‘i by computing

Normalization by decimal scaling normalizes by moving the decimal point of values


of attribute A. The number of decimal points moved depends on the maximum absolute value
of A. A value, vi , of A is normalized to v‘i by computing

PG & Research Department Computer Science and Applications Page


44
Data Mining III- BCA

2.5 DATA REDUCTION


Data mining is applied to the selected data in a large amount database. When data
analysis and mining is done on a huge amount of data, then it takes a very long time to
process, making it impractical and infeasible.
Data reduction techniques ensure the integrity of data while reducing the data.
Data reduction is a process that reduces the volume of original data and represents it
in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset
that is much smaller in volume by maintaining the integrity of the original data.
By reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the
result obtained from data mining before and after data reduction is the same or almost the
same.
Data reduction aims to define it more compactly. When the data size is smaller, it is
simpler to apply sophisticated and computationally high-priced algorithms.
The reduction of the data may be in terms of the number of rows (records) or terms of
the number of columns (dimensions).

2.5.1 Overview of Data Reduction Strategies


 Dimensionality reduction
 Numerosity reduction
 Data compression

Dimensionality Reduction
Dimensionality reduction n is the process of reducing the number of random variables
or attributes under consideration, thereby reducing the volume of original data. It reduces
data size as it eliminates outdated or redundant features.
This process is necessary for real-world applications as the data size is big.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space and computation time is
reduced.

PG & Research Department Computer Science and Applications Page


45
Data Mining III- BCA

Here are three methods of dimensionality reduction


 Wavelet Transform
 Principal Component Analysis
 Attribute Subset Selection
Wavelet Transform
The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X', of wavelet
coefficients. The two vectors are of the same length.
If the wavelet transformed data are of the same length as the original data?‖ The
usefulness lies in the fact that the wavelet transformed data can be truncated.
Example
Transforming Y into Y’ helps us to reduce the data. This Y’ data can be trimmed or
truncated whereas the actual vector Y cannot be compressed.
A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.
The compressed data is obtained by retaining the smallest fragment of the strongest
wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data, or skewed
data.
Principal Component Analysis
Principal components analysis which transform or project the original data onto a
smaller space.
Principal component analysis, a technique for data reduction in data mining, groups
the important variables into a component taking the maximum information present within the
data and discards the other, not important variables.
Suppose we have a data set to be analyzed that has tuples with n attributes. The
principal component analysis identifies k independent tuples with n attributes that can
represent the data set.
In short, PCA is applied to reducing multi-dimensional data into lower-dimensional
data. This is done by eliminating variables containing the same information as provided by
other variables and combining the relevant variables into components.
Principal component analysis can be applied to sparse and skewed data.

PG & Research Department Computer Science and Applications Page


46
Data Mining III- BCA

Principal components analysis (PCA; also called the Karhunen-Loeve, or K-L,


method) searches for k n-dimensional orthogonal vectors that can best be used to represent
the data, where k ≤ n.
The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with smaller
domains.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These vectors
are referred to as the principal components. The input data are a linear combination of the
principal components.
3. The principal components are sorted in order of decreasing ―significance‖ or strength. That
is, the sorted axes are such that the first axis shows the most variance among the data, the
second axis shows the next highest variance, and so on.
4. Because the components are sorted in decreasing order of ―significance,‖ the data size can
be reduced by eliminating the weaker components, that is, those with low variance. Using the
strongest principal components, it should be possible to reconstruct a good approximation of
the original data.
PCA can be applied to ordered and unordered attributes, and can handle sparse data
and skewed data.
Multidimensional data of more than two dimensions can be handled by reducing the
problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster
analysis.
Attribute Subset Selection
The large data set has many attributes some of which are irrelevant to data mining or
some are redundant. Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
The attribute subset selection reduces the volume of data by eliminating redundant
and irrelevant attributes so it ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes.

PG & Research Department Computer Science and Applications Page


47
Data Mining III- BCA

The goal of attribute subset selection is to find a minimum set of attributes such that
the resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit: It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier to
understand.
An exhaustive search for the optimal subset of attributes can be prohibitively
expensive. Therefore, heuristic methods that explore a reduced search space are commonly
used for attribute subset selection. These methods are typically greedy. Such greedy methods
are effective in practice and may come close to estimating an optimal solution
The ―best‖ (and ―worst‖) attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one another.
Many other attribute evaluation measures can be used such as the information gain
measure used in building decision trees for classification.

Greedy (heuristic) methods for attribute subset selection


Basic heuristic methods of attribute subset selection include the techniques are,
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At
each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining
attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchartlike
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction. At each node, the algorithm chooses the ―best‖ attribute to partition the data into
individual classes.

PG & Research Department Computer Science and Applications Page


48
Data Mining III- BCA

Figure 2.4 Greedy (heuristic) methods for attribute subset selection

Numerosity Reduction
Numerosity reduction techniques reduces the original data volume and represents it
in a much smaller form.
In this method, the representation of the data is made smaller by reducing the volume.
This technique includes two types parametric and non-parametric numerosity
reduction.
Parametric
Parametric numerosity reduction incorporates storing only data parameters instead of
the original data.
One method of parametric numerosity reduction is the regression and log-linear
method.
Regression and Log-Linear
Linear regression models a relationship between the two attributes by modeling a
linear equation to the data set. Suppose we need to model a linear function between two
attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in terms
of data mining, attribute x and attribute y are the numeric database attributes, whereas w and
b are regression coefficients.

PG & Research Department Computer Science and Applications Page


49
Data Mining III- BCA

Multiple linear regressions let the response variable y model linear function between
two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in the
database. Suppose we have a set of tuples presented in n-dimensional space. Then the log-
linear model is used to study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
Non-Parametric
A non-parametric numerosity reduction technique does not assume any model. The
non-Parametric technique results in a more uniform reduction, irrespective of data size, but it
may not achieve a high volume of data reduction like the parametric.
Nonparametric methods for storing reduced representations of the data include,
 Histograms
 Clustering
 Sampling
 Data cube aggregation
Histogram
A histogram is a graph that represents frequency distribution which describes how
often a value appears in the data.
Histogram uses the binning method to represent an attribute's data distribution. It uses
a disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of only
one attribute, the histogram can be implemented for multiple attributes. It can effectively
represent up to five attributes.

Figure 2.5 Histogram

PG & Research Department Computer Science and Applications Page


50
Data Mining III- BCA

Clustering
Clustering techniques groups similar objects from the data so that the objects in a
cluster are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance
function. More is the similarity between the objects in a cluster closer they appear in the
cluster.
The quality of the cluster depends on the maximum distance between any two data
items in the cluster.
The cluster representation replaces the original data. This technique is more effective
if the present data can be classified into a distinct clustered.
Sampling
One of the methods used for data reduction is sampling, as it can reduce the large data
set into a much smaller data sample.
Sampling is capable of reducing large data set into smaller sample data sets, reducing
it to a representation of the original data set.
There are four types of sampling data reduction methods. They are,
 Simple Random Sample Without Replacement of sizes
 Simple Random Sample with Replacement of sizes
 Cluster Sample
 Stratified Sample
Data cube aggregation
The data cube aggregation is a multidimensional aggregation which eases
multidimensional analysis.
Like in the image below the data cube represent annual sale for each item for each
branch. The data cube present precomputed and summarized data which eases the data
mining into fast access.

Figure 2.6 Data cube aggregation

PG & Research Department Computer Science and Applications Page


51
Data Mining III- BCA

Data Compression
Data compression is a technique where the data transformation technique is applied to
the original data in order to obtain compressed data.
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding).
We can divide it into two types based on their compression techniques.
 Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original
data from the compressed data.
 Lossy Compression
Methods such as Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In
lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.

2.6 DATA DISCRETIZATION


Data discretization transforms numeric data by mapping values to interval or concept
labels. Such methods can be used to automatically generate concept hierarchies for the data,
which allows for mining at multiple levels of granularity.
Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by labels of
small intervals. This means that mining results are shown in a concise, and easily
understandable way.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis.
2.6.1 Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. These
methods are also used as discretization methods for data reduction and concept hierarchy
generation.

PG & Research Department Computer Science and Applications Page


52
Data Mining III- BCA

For example, attribute values can be discretized by applying equal-width or equal-


frequency binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively.
These techniques can be applied recursively to the resulting partitions to generate
concept hierarchies.
Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers.
Top-down discretization also known as splitting.
2.6.2 Discretization by Histogram
Histogram analysis is an unsupervised discretization technique because it does not use
class information. A histogram partitions the values of an attribute, A, into disjoint ranges
called buckets or bins.
In an equal-width histogram, for example, the values are partitioned into equal-size
partitions or ranges.
With an equal-frequency histogram, the values are partitioned so that, ideally, each
partition contains the same number of data tuples.
The histogram analysis algorithm can be applied recursively to each partition in order
to automatically generate a multilevel concept hierarchy, with the procedure terminating once
a prespecified number of concept levels has been reached.
2.6.3 Discretization by Cluster, Decision Tree, and Correlation Analyses
Cluster analysis is a popular data discretization method. A clustering algorithm can
be applied to discretize a numeric attribute, A, by partitioning the values of A into clusters or
groups. Clustering takes the distribution of A into consideration, as well as the closeness of
data points, and therefore is able to produce high-quality discretization results.
Clustering can be used to generate a concept hierarchy for A by following either a
top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy.
In the former, each initial cluster or partition may be further decomposed into several
subclusters, forming a lower level of the hierarchy.
In the latter, clusters are formed by repeatedly grouping neighboring clusters in order
to form higher-level concepts.

PG & Research Department Computer Science and Applications Page


53
Data Mining III- BCA

Techniques to generate decision trees for classification can be applied to


discretization. Such techniques employ a top-down splitting approach.
Decision tree approaches to discretization are supervised, that is, they make use of
class label information.
Intuitively, the main idea is to select split-points so that a given resulting partition
contains as many tuples of the same class as possible. Entropy is the most commonly used
measure for this purpose.
To discretize a numeric attribute, A, the method selects the value of A that has the
minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at
a hierarchical discretization. Such discretization forms a concept hierarchy for A.
Measures of correlation can be used for discretization. ChiMerge is a χ2-based
discretization method. This contrasts with ChiMerge, which employs a bottom-up approach
by finding the best neighboring intervals and then merging them to form larger intervals,
recursively.

PG & Research Department Computer Science and Applications Page


54
Data Mining III- BCA

3. Data Mining Techniques

3.1 ASSOCIATION RULE MINING

Association rule mining is a technique that analysis a set of transactions, each


transaction being a list of products or items purchased by one customer.
The aim of association rule mining is to determine which items are purchased together
frequently.
The term lift is used to measure the power of association between items that are
purchased together.
Association rule mining searches for interesting relationship among items in a given
data set.
A simple example is analyzing a large database of supermarket transaction with the
aim of finding association rules.
3.1.1 Basic Concepts
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that
appear frequently in a data set.
For example, a set of items, such as milk and bread, that appear frequently together in
a transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences.
If a substructure occurs frequently, it is called a (frequent) structured pattern.
Finding frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data.
Assume that the shop sells only a small variety of products,
Bread Egg coffee
Juice Milk Tea
Biscuit cheese Sugar
We assume that the shopkeeper keeps records of what each customer purchases,

PG & Research Department Computer Science and Applications Page


55
Data Mining III- BCA

Records of ten customers are,

Transaction ID Items
10 Bread, Egg, cheese
20 Bread, Egg, Juice
30 Bread, Milk
40 Egg, Juice, Milk, Coffee
50 Sugar, Tea, Coffee, Biscuit, Cheese
60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Cheese
70 Bread, Egg
80 Bread, Egg, Juice, Coffee
90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Cheese

Each row in the table gives the set of items that one customer bought.
The shopkeeper wants to find which products (items) are sold together frequently.
Let I = { i1, i2, …, im } be a set of items
We assume that there are N transactions, we denote them by
T { t1, t2, …, tN }
Each transaction is associated with an identifier called TID (Transaction ID).
Let A be a set of items
A transaction T is said to contain A if and only if A  T.
An association rule is an implication of the form AB
where AI, BI and AB=
3.1.2 Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations
among items in large transactional or relational data sets.
With massive amounts of data continuously being collected and stored, many
industries are becoming interested in mining such patterns from their databases.
The discovery of interesting correlation relationships among huge amounts of
business transaction records can help in many business decision-making processes such as
catalog design, cross-marketing, and customer shopping behavior analysis.
Market Basket Analysis is an example for Association Rule Mining.

PG & Research Department Computer Science and Applications Page


56
Data Mining III- BCA

This process analyzes customer buying habits by finding associations between the
different items that customers place in their ―shopping baskets‖.

Figure 3.1 Market basket analyses

The discovery of these associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.
Each basket can then be represented by a Boolean vector of values assigned to these
variables. The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together. These patterns can be represented in the form of
association rules.
Association rules
Association rules are written as XY meaning that whenever X appears Y also
tends to appear.
X is referred to as the rule‘s antecedent and Y as the Consequent
Rule support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rule .
Suppose items X and Y appear together in only 10% of the transactions but whenever
X appears there is an 80% chance that Y also appears.
The 10% presence of X and Y together is called the Support (or prevalence) of the
rule and the 80% chance is called Confidence (or predictability) of the rule.

PG & Research Department Computer Science and Applications Page


57
Data Mining III- BCA

ie.
Support (X  Y) = P(X  Y)
Sup(X) = P(X)/N
Sup(XY) = P(XY)/N
Confidence (X  Y) = P(Y/X)
= Support (XY)/Support(X)
Typically, association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold. These thresholds can be a set by
users or domain experts.
Rules that satisfy both a minimum support threshold (min sup) and a minimum
confidence threshold (min conf ) are called strong.
A set of items is referred to as an itemset. 2 An itemset that contains k items is a k-
itemset. The set {computer, antivirus software} is a 2-itemset.
Frequency or support count
The occurrence of an itemset in the number of transactions that contain the itemset is
known as frequent or support count or count of itemset.
Sometimes the term lift is used to measure the power of association between items
that are purchased together.
Lift may be defined as P(Y/X) /P(Y)
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition,
these rules must satisfy minimum support and minimum confidence.
3.1.3 Applications of Association rule mining
Applications of Association rule mining are,
 Marketing
 Customer segmentation
 Medicine
 E-commerce
 Classification
 Web mining
 Bioinformatics
 finance

PG & Research Department Computer Science and Applications Page


58
Data Mining III- BCA

3.2 THE APRIORI ALGORITHM


3.2.1 Introduction
Apriori is the basic algorithm for finding frequent itemsets for Boolean association
rules.
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for
mining frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets
are used to explore (k + 1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
3.2.2 Parts of Apriori algorithm
This algorithm may be consists of two parts, they are,
 Those itemsets that exceed the minimum support requirement are found such
itemsets are called frequent itemsets.
 The association rules that meet the minimum confidence requirement are found
from the frequent itemsets.
3.2.3 Notation used in Apriori algorithm
The main notation for association rule mining that is used in the Apriori algorithm is,
 A k-itemset is a set of k-item
 The set Ck is a set of candidate k-itemsets that are potentially frequent.
 The set Lk is a subset of Ck and is the set of k-itemsets that are frequent.
3.2.4 Algorithmic aspects of the Apriori algorithm
 Computing L1
 Apriori-gen function
 Pruning
 Apriori subset function

PG & Research Department Computer Science and Applications Page


59
Data Mining III- BCA

Computing L1
Scan the database only once to obtain L1. Once the scan of the database is finished
and the count for each item found and L1 determined.

Apriori-gen function
It takes an argument Lk-1 and returns a set of all candidate k-itemsets.
In computing C3 from L2
If an itemset in C3 is (a, b, c) then L2 must have items (a, b) and (a, c) so all subsets of
C3 must be frequent.

Pruning
Once a candidate set Ci has been produced, we can prune some of the candidate
itemsets by checking all subsets of every itemset in the set are frequent.
Example,
If we have derived {a, b, c} from {a, b} and {a, c} then we check that {b, c} is also in
L2 if it is not {a, b, c} may be removed from C3.

Apriori subset function


The candidate itemsets Ck are stored in a hash tree to improve the efficiency of
searching.
The leaves of the hash tree store itemsets while the internal nodes provide a roadmap
to reach the leaves.
Each leaf node is reached by traversing the tree whose root is at depth1.
Apriori iteratively computes frequent itemsets Lk+1 based on Lk

PG & Research Department Computer Science and Applications Page


60
Data Mining III- BCA

3.2.5 EXAMPLE
Assume an example of only five transactions and six items.
We want to find association rule with 50% support and 75% confidence.
Transactions
TID Items
100 Bread, Cheese, Egg, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Butter
400 Bread, Juice, Milk
500 Cheese, Juice, Milk

First we can find L1, Here,


Bread appears 4 times
Cheese appears 3 times
Egg appears 1 time
Juice appears 4 times
Milk appears 3 times
Butter appears 1 time
We require 50% support and therefore each frequent item must appear in at least three
transactions
Frequent item L1 is
Item Frequency
Bread 4
Cheese 3
Juice 4
Milk 3

The candidate 2-itemsets or C2 has six pairs, these pairs and their frequencies are,
Candidate item pairs C2
Item pairs Frequency
(Bread, Cheese) 2
(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2

PG & Research Department Computer Science and Applications Page


61
Data Mining III- BCA

 We have only two frequent item pairs which are {Bread, Juice} and {Cheese,
Juice}. This is L2.
From these two frequent 2-itemsets, we do not obtain a candidate 3-itemset since we
do not have two 2-itemsets that have same first item.
The two frequent 2-itemset
Bread  Juice
Juice  Bread
Cheese  Juice
Juice  Cheese
The confidence of these rules is obtained by dividing the support for both items in the
rule by the support for the item on the left-hand side of the rule.
The confidences of the four rules are,
Bread  Juice = sup (XY)/sup(X) =3/4=75%
Juice  Bread = 3/4 =75%
Cheese  Juice = 3/3=100%
Juice  Cheese =3/4=75%
Since all of them have a minimum 75% of confidence, they all qualify.

3.2.6 Improving the Efficiency of Apriori Algorithm


Some approaches to improve Apriori‘s efficiency:
 Partitioning: Any itemset that is potentially frequent in a transaction database must
be frequent in at least one of the partitions of the transaction database.
 Sampling: This extracts a subset of the data with a lower support threshold and uses
the subset to perform association rule mining.
 Transaction reduction: A transaction that does not contain frequent k-itemsets is
useless in subsequent scans and therefore can be ignored.
 Hash-based itemset counting: If the corresponding hashing bucket count of a k-
itemset is below a certain threshold, the k-itemset cannot be frequent.
 Dynamic itemset counting: Only add new candidate itemsets when all of their
subsets are estimated to be frequent.

PG & Research Department Computer Science and Applications Page


62
Data Mining III- BCA

3.3 MULTILEVEL ASSOCIATION RULES


Multilevel associations involve concepts at different abstraction levels.
Association rules generated from mining data at multiple levels of abstraction are
called multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under
a support-confidence framework.
Multilevel association rules can be mined effectively utilizing idea progressions under
a help certainty system.
For many applications, strong associations discovered at high abstraction levels,
though with high support, could be commonsense knowledge. We may want to drill down to
find novel patterns at more detailed levels.
On the other hand, there could be too many scattered patterns at low or primitive
abstraction levels, some of which are trivial specializations of patterns at higher levels.
Therefore, it is interesting to examine how to develop effective methods for mining patterns
at multiple abstraction levels, with sufficient flexibility for easy traversal among different
abstraction spaces.
Approaches to multilevel association rule mining
 Uniform Support(Using uniform minimum support for all level)
 Reduced Support (Using reduced minimum support at lower levels)
 Group-based Support(Using item or group based support)
Uniform Support
The same minimum support threshold is used when mining at each abstraction level.

Figure 3.2 Multilevel mining with uniform support


For example, a minimum support threshold of 5% is used throughout (e.g., for mining
from ―computer‖ downward to ―laptop computer‖). Both ―computer‖ and ―laptop computer‖
are found to be frequent, whereas ―desktop computer‖ is not.

PG & Research Department Computer Science and Applications Page


63
Data Mining III- BCA

When a uniform minimum support threshold is used, the search procedure is


simplified.
An optimization technique can be adopted, based on the knowledge that an ancestor is
a superset of its descendants: the search avoids examining itemsets containing any item
whose ancestor does not have minimum support.
The main drawback of the uniform support approach is that the items at lower levels
of abstraction will occur as frequently as those at higher levels of abstraction.

Reduce Support

Each level of abstraction has its own minimum support threshold. The deeper the
abstraction level, the smaller the corresponding threshold.

Figure 3.3 Multilevel mining with reduced support

For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ are all
considered frequent.
Search categories for mining multiple-level association with reduced support are,

 Level by level independent − It is a full breadth search, background knowledge of


frequent itemsets is used for pruning. Here each node is examined regardless of the
parent node is found to be frequent.
th
 Level cross-filtering by a single item − An item as the i level is determined if and
th
only if its parent node at the (i-1) level is frequent.
th
 Level cross-filtering by k-itemset − An itemset at the i level is determined if and
th
only if its equivalent parent k-itemset at the (i-1) level is frequent.

PG & Research Department Computer Science and Applications Page


64
Data Mining III- BCA

Group-based support
The group-wise threshold value for support and confidence is input by the user or
expert. The group is selected based on a product price or item set because often expert has
insight as to which groups are more important than others.
Example
For e.g. Experts are interested in purchase patterns of laptops or clothes in the non and
electronic category. Therefore low support threshold is set for this group to give attention to
these items‘ purchase patterns.

3.4 MULTIDIMENSIONAL ASSOCIATION RULES


Multidimensional associations involve more than one dimension or predicate.
Association rules with two or more dimensions or predicates can be referred to as
multidimensional association rules.
For example,
Age (X, "20...29") ^occupation (X,"Student") =>buys (X,"Laptop")
This rule contains three predicates (age, occupation, and buys), each of which occurs
only once in the rule.
Multidimensional association rules with no repeated predicates are called
interdimensional association rules.
The rules with repeated predicates or containing multiple occurrences of some
predicates are called hybrid-dimension association rules.
For example,
Age (X, "20...29") ^buys (X,"Laptop") =>buys (X,"Printer")
The database attributes should be categorical or quantitative.
Categorical attributes have a finite number of possible values, with no ordering
among the values also called nominal attributes.
Quantitative attributes are numeric and have an implicit sequencing between values.
The three basic approaches regarding the treatment of quantitative attributes are as follows
 In the first approach, quantitative attributes are discretized using a predefined concept
hierarchy, which occurs before mining. The discretized numeric attributes with their
range values can be considered as categorical attributes.

PG & Research Department Computer Science and Applications Page


65
Data Mining III- BCA

 In the second approach, quantitative attributes are categorized in bins and it is based
on the distribution of the data. These bins can be further combined during the mining
process. Therefore the process of discretization is dynamic and established.
 In the third approach, quantitative attributes are discretized to capture the semantic
meaning of such interval data. This powerful discretization phase treated the distance
among data points.

3.5 CONSTRAINT BASED ASSOCIATION MINING


Concept
Metarule-Guided Rule Mining
Constraint Pushing
Types of Rule Constraints
Concept
A data mining process may uncover thousands of rules from a given data set, most of
which end up being unrelated or uninteresting to users. Often, users have a good sense of
which ―direction‖ of mining may lead to interesting patterns and the ―form‖ of the patterns or
rules they want to find.
They may also have a sense of ―conditions‖ for the rules, which would eliminate the
discovery of certain rules that they know would not be of interest. Thus, a good heuristic is to
have the users specify such intuition or expectations as constraints to confine the search
space. This strategy is known as constraint-based mining.
Constraint-based algorithms need constraints to decrease the search area in the
frequent itemset generation step (the association rule generating step is exact to that of
exhaustive algorithms).
The general constraint is the support minimum threshold. If a constraint is
uncontrolled, its inclusion in the mining phase can support significant reduction of the
exploration space because of the definition of a boundary inside the search space lattice,
following which exploration is not needed.
Forms of Constraints
The constraints can include the following:
 Knowledge type constraints: These specify the type of knowledge to be mined, such
as association, correlation, classification, or clustering.
 Data constraints: These specify the set of task-relevant data.

PG & Research Department Computer Science and Applications Page


66
Data Mining III- BCA

 Dimension/level constraints: These specify the desired dimensions (or attributes) of


the data, the abstraction levels, or the level of the concept hierarchies to be used in
mining.
 Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness such as support, confidence, and correlation.
 Rule constraints: These specify the form of the rules to be mined. Such constraints
may be expressed as metarules (rule templates), as the maximum or minimum number
of predicates that can occur in the rule antecedent or consequent, or as relationships
among attributes, attribute values, and/or aggregates.
These constraints can be specified using a high-level declarative data mining query
language and user interface.
Constraint-based mining allows users to describe the rules that they would like to
uncover, thereby making the data mining process more effective.
Constraint-based mining encourages interactive exploratory mining and analysis.
Metarule-Guided Mining of Association Rules
Metarules allow users to specify the syntactic form of rules that they are interested in
mining. The rule forms can be used as constraints to help improve the efficiency of the
mining process.
Metarules may be based on the analyst‘s experience, expectations, or intuition
regarding the data or may be automatically generated based on the database schema.
Example: Metarule-guided mining
A metarule can be used to specify this information describing the form of rules are
interested in finding.
An example of such a metarule is

where P1 and P2 are predicate variables that are instantiated to attributes from the
given database during the mining process, X is a variable representing a customer, and Y and
W take on values of the attributes assigned to P1 and P2, respectively. Typically, a user will
specify a list of attributes to be considered for instantiation with P1 and P2. Otherwise, a
default set may be used.

PG & Research Department Computer Science and Applications Page


67
Data Mining III- BCA

Constraint Pushing: Mining guided by Rule Constraints


Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and constraints on aggregate functions and other forms
of constraints.
Users typically employ their knowledge of the application or data to specify rule
constraints for the mining task.
These rule constraints may be used together with, or as an alternative to, metarule-
guided mining.
Rule constraints are used to mine hybrid-dimensional association rules.
Types of Rule Constraints
Rule constraints can be categorized as,
 antimonotonic
 monotonic
 succinct
 convertible
 inconvertible
antimonotonic
If an itemset does not satisfy this rule constraint, then none of its supersets can satisfy
the constraint.
Pruning by antimonotonic constraints can be applied at each iteration of Apriori-style
algorithms to help improve the efficiency.
Example:
min(J.price) ≥ $50
count(I) ≤ 10
monotonic
If an itemset I satisfies the rule constraint, then all of its supersets can satisfy the
constraint.
Example:
sum(I.price) > $100
succinct
All and only those sets that are guaranteed to satisfy the rule can be enumerated.
The itemsets can be directly generated that satisfy the rule, even before support
counting begins.

PG & Research Department Computer Science and Applications Page


68
Data Mining III- BCA

Example
min(J.price) >50
max(S) < 120
convertible
Constraints not satisfying to any of antomonotonic, monotonic, succinct can be made
to satisfy.
If the items in the itemset are arranged in a particular order, the constraint may
become monotonic or antimonotonic with regard to the frequent itemset mining process.
Example
avg(I.price) ≤ 10
inconvertible
Constraint which are not convertible
Example
Sum(S) < v , sum(S) > v
Elements of S could be any real value.

PG & Research Department Computer Science and Applications Page


69
Data Mining III- BCA

4. Classification and Prediction

There are two forms of data analysis that can be used to extract models describing
important classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
Classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
For example, we can build a classification model to categorize bank loan applications
as either safe or risky. Such analysis can help provide us with a better understanding of the
data at large.
Many classification methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Developing scalable classification and prediction techniques capable of handling large
amounts of disk-resident data.
Classification has numerous applications, including
 Fraud detection
 Target marketing
 Performance prediction
 Manufacturing
 Medical diagnosis
Multiclass classification
Classification is the process of classifying a record. One simple example of
classification is to check whether it is raining or not. The answer can either be yes or no. So,
there are a particular number of choices. Sometimes there can be more than two classes to
classify. That is called multiclass classification.
Prediction
Another process of data analysis is prediction. It is used to find a numerical output.
Same as in classification, the training dataset contains the inputs and corresponding
numerical output values.
Unlike in classification, this method does not have a class label. The model predicts a
continuous-valued function or ordered value.

PG & Research Department Computer Science and Applications Page


70
Data Mining III- BCA

Regression is generally used for prediction. Predicting the value of a house depending
on the facts such as the number of rooms, the total area, etc., is an example for prediction.

4.1 ISSUES REGARDING CLASSIFICATION AND PREDICTION


The major issue is preparing the data for Classification and Prediction. Preparing the
data involves the following activities, such as:

Data cleaning
Data cleaning involves removing the noise and treatment of missing values. The noise
is removed by applying smoothing techniques, and the problem of missing values is solved
by replacing a missing value with the most commonly occurring value for that attribute or
with the most probable value based on statistics.
Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
Relevance analysis
Many of the attributes in the data may be redundant. Correlation analysis can be used
to identify whether any two given attributes are statistically related.
For example, a strong correlation between attributes A1 and A2 would suggest that
one of the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be
used in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained

PG & Research Department Computer Science and Applications Page


71
Data Mining III- BCA

using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute
subset selection, can be used to detect attributes that do not contribute to the classification or
prediction task. Hence, such analysis can help improve classification efficiency and
scalability.
Data transformation and reduction
The data can be transformed by any of the following methods.
 Normalization: The data is transformed using normalization. Normalization involves
scaling all values for a given attribute to make them fall within a small specified
range. Normalization is used when the neural networks or the methods involving
measurements are used in the learning step.
 Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.

4.2 DECISION TREE INDUCTION


Decision tree induction is the learning of decision trees from class-labeled training
tuples.
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf
node) denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node.
A decision tree for “buys_computer”

Figure 4.1 Decision tree

It represents the concept buys computer, that is, it predicts whether a customer at a
company is likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf
nodes are denoted by ovals.
PG & Research Department Computer Science and Applications Page
72
Data Mining III- BCA

Decision Tree is a supervised learning method used in data mining for classification
and regression methods. It is a tree that helps us in decision-making purposes.
The decision tree creates classification or regression models as a tree structure. It
separates a data set into smaller subsets, and at the same time, the decision tree is steadily
developed.
Decision trees can deal with both categorical and numerical data.
Characteristics of Decision tree
 Decision tree is a model that is both predictive and descriptive.
 A decision tree is a tree that displays relationship found in the training data.
 The training process that generates the tree is called induction.
 The decision tree technique is popular because the rules generated are easy to describe
and understand.
 The technique is fast unless the data is very large and there is a variety of software
available.
 Normally, the complexity of a decision tree increases as the number of attributes
increases.
 The quality of training data usually plays an important role in determining the quality
of the decision tree.
Benefits of decision tree
The benefits of having a decision tree are as follows
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
 It can handle multidimensional data.
 Decision tree classifiers have good accuracy
Applications
Decision tree induction algorithms have been used for classification in many
application areas such as medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the
successor of ID3.

PG & Research Department Computer Science and Applications Page


73
Data Mining III- BCA

ID3, C4.5 and CART(Classification and Regression Trees) adopt a greedy approach.
In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D


Algorithm : Generate_decision_tree
Input:
 Data partition, D, which is a set of training tuples and their associated class labels.
 attribute_list, the set of candidate attributes.
 Attribute selection method, a procedure to determine the splitting criterion that best
partitions that the data tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and multiway splits allowed then
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority class in D to node N;
else
attach the node returned by Generate decision tree(Dj, attribute list) to node N;
end for
return N;

PG & Research Department Computer Science and Applications Page


74
Data Mining III- BCA

Attribute Selection Measures


An attribute selection measure is a heuristic for selecting the splitting criterion that
―best‖ separates a given data partition, D, of class-labeled training tuples into individual
classes.
Conceptually, the ―best‖ splitting criterion is the most approximately results in such a
method.
Attribute selection measures are also known as splitting rules because they determine
how the tuples at a given node are to be split.
The attribute selection measure supports a ranking for every attribute defining the
given training tuples. The attribute having the best method for the measure is selected as the
splitting attribute for the given tuples.
Three popular attribute selection measures are
 Information gain
 Gain ratio
 Gini index
Information gain
ID3 uses information gain as its attribute selection measure.
Information gain is used for deciding the best features/attributes that render maximum
data about a class. It follows the method of entropy while aiming at reducing the level of
entropy, starting from the root node to the leaf nodes.
According to the value of information gain, we split the node and build the decision
tree.
Let node N defines or hold the tuples of partition D. The attribute with the highest
information gain is selected as the splitting attribute for node N. This attribute minimizes the
data required to define the tuples in the resulting subdivide and reflects the least randomness
or ―impurity‖ in these subdivide.
The expected information needed to classify a tuple in D is given by

Info(D) is also known as the entropy of D


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data.

PG & Research Department Computer Science and Applications Page


75
Data Mining III- BCA

Entropy can be calculated as:


Entropy(s) = - P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
Example

Gain ratio
Information gain is biased towards choosing attributes with a large number of values
as root nodes. It means it prefers the attribute with a large number of distinct values.
C4.5 uses Gain ratio which is a modification of Information gain that reduces its bias
and is usually the best option.
The gain ratio is defined as

Where

Gain ratio overcomes the problem with information gain by taking into account the
number of branches that would result before making the split.
Gini index
The Gini index can be used in CART.
The Gini index calculates the impurity of D, a data partition or collection of training
tuples, as

Where pi is the probability that a tuple in D belongs to class Ci


An attribute with the low Gini index should be preferred as compared to the high Gini
index.

PG & Research Department Computer Science and Applications Page


76
Data Mining III- BCA

Example

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
Some branches of the tree may reflect anomalies due to noise or outliers in the
training samples. Such decision trees are a result of overfitting the training data and may
result in poor accuracy.
Tree pruning methods address the problem of overfitting the data.
Pruning
Pruning is a technique that removes some splits and sub tree created.
Pruning is the procedure that decreases the size of decision trees. It can decrease the
risk of overfitting by defining the size of the tree
Pruning is a technique to make an overfitted decision tree simpler and more general.
There are a number of techniques for pruning a decision tree by removing some splits
and sub tree created by them.
Pruning method use statistical measures to remove the least reliable branches so it
provides faster classification and an improvement in the ability of the tree to correctly
classify independent test data.
There are two approaches to tree pruning, they are,
 Perpruning
 Postpruning
Prepruning
In the prepruning approach, a tree is pruned by halting its construction early.
Upon halting, the node becomes a leaf, it may hold the most frequent class among
subset samples.
Postpruning
Postpruning removes branches(sub-tree) from a ―fully grown‖ tree.
The tree node is pruned by removing its branches.

PG & Research Department Computer Science and Applications Page


77
Data Mining III- BCA

4.3 BAYESIAN CLASSIFICATION


Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class
membership probabilities such as the probability that a given tuple belongs to a particular
class.
Bayesian classification is based on Bayes' Theorem.
It uses the given values to train a model and then it uses this model to classify new
data
In Bayesian classification, we have a hypothesis that the given data belongs to a
particular class, and then we calculate the probability for the hypothesis to be true.
The approach requires only one scan of the whole data, Also if at some stage there are
additional training data then each training example can incrementally increase/decrease the
probability that a hypothesis is correct.
Bayes’ theorem
The expression P(A) refers to the probability that event A will occur.
P(AB) stands for the probability that event A will happen, given that event B has
already happened or
P(AB) is the conditional probability of A based on the condition that B has already
happened.

Consider X to be an object to be classified then Bayes theorem may read as giving the
probability of it belonging to one of the classes C1, C2, C3 etc by calculating P(Ci  X).
Probabilities P(Ci  X) may be calculated as

 P(Ci  X) is the probability of the object X belonging to class Ci.


 P(X  Ci) is the probability of obtaining attribute values X if we know that it belongs
to class Ci.
 P(Ci) is the probability of any object belonging to class Ci without any other
information.
 P(X) is the probability of obtaining attribute value X whatever class the object
belongs to.

PG & Research Department Computer Science and Applications Page


78
Data Mining III- BCA

We use a naïve approach by assuming that all attributes of X are independent which is
often not true to complete P(X  Ci) that is why it is called The Naïve bayes model.
The beauty of the Bayesian approach is that the probability of the dependent attribute
can be estimated by computing estimates of the probabilities of the independent attributes.
Example: Naïve Bayes Method

There are 10 samples and 3 classes


Credit risk class A = 3
Credit risk class B = 3
Credit risk class C = 4
Prior probability is obtained by dividing these frequencies by the total number in the
training data.
P(A) = 3/10 = 0.3
P(B) = 3/10 = 0.3
P(C) = 4/10 = 0.4
If the data is presented is {Yes, No, Female, Yes, A} for the five attributes, we can
compute the posterior probability for each class as
P(X  Ci) = P ({Yes, No, Female, Yes, A})  Ci)
= P (Own Home = Yes  Ci) X P (Married = No  Ci) X
P (Gender = Female  Ci) X P (Employed = Yes  Ci) X
P (Credit Rating = A  Ci).

PG & Research Department Computer Science and Applications Page


79
Data Mining III- BCA

We compute P(X  Ci) P (Ci) for each of the three classes given P(A)=0.3, P(B)=0.3
and P(C)=0.4 and these values are the basis for comparing the three classes.
To compute P(X  Ci) = P ({Yes, No, Female, Yes, A})  Ci) for each classes, we need
probabilities for each,
P (Own Home = Yes  Ci)
P (Married = No  Ci)
P (Gender = Female  Ci)
P (Employed = Yes  Ci)
P (Credit Rating = A  Ci)
We can order the data by risk class to make it convenient.
Naïve Bayes Method

Given the estimates of the probabilities, we can compute the posterior probabilities,
P(X  A) = 1/3 X 1 X 1 X 1 X 2/3 = 2/9
P(X  B) = 2/3 X 2/3 X 0 X 1/3 X 1/3 = 0
P(X  C) = 0.5 X 0 X 1 X 1 X 0.5 = 0
 The values of P(X  Ci)P(Ci) are Zero for classes B and C and
0.3 x 2/9 =0.0666 for class A.
So X is assigned to Class A.

PG & Research Department Computer Science and Applications Page


80
Data Mining III- BCA

4.4 BACKPROPAGATION
Backpropagation is a neural network learning algorithm.
Neural network is a set of connected input/output units in which each connection has
a weight associated with it.
During the learning phase, the network learns by adjusting the weights so as to be
able to predict the correct class label of the input tuples.
Neural network learning is also referred to as connectionist learning due to the
connections between units.
Neural Network as a Classifier
Weakness
 Neural networks involve long training times
 They require a number of parameters that are typically best determined
empirically such as the network topology or ―structure.‖
 Poor interpretability - Difficult to interpret the symbolic meaning behind the
learned weights and of ―hidden units‖ in the network.
Advantages of neural networks
 High tolerance of noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on a wide array of real-world data
 Neural network algorithms are inherently parallel
Several techniques have been recently developed for rule extraction from trained
neural networks. These factors contribute to the usefulness of neural networks for
classification and numeric prediction in data mining.
A Multilayer Feed-Forward Neural Network
The backpropagation algorithm performs learning on a multilayer feed-forward neural
network. It iteratively learns a set of weights for prediction of the class label of tuples.
Multilayer Feed-Forward Neural Network (MFFNN) is an interconnected Artificial
Neural Network with multiple layers that has neurons with weights associated with them and
they compute the result using activation functions.
It is one of the types of Neural Networks in which the flow of the network is from
input to output units and it does have any loops, no feedback, and no signal moves in
backward directions that are from output to hidden and input layer.

PG & Research Department Computer Science and Applications Page


81
Data Mining III- BCA

A multilayer feed-forward neural network consists of an input layer, one or more


hidden layers, and an output layer.

Figure 4.2 Multilayer feed-forward neural network

In this network there are the following layers:


 Input Layer: The inputs to the network correspond to the attributes measured for
each training tuple. The inputs are fed simultaneously into the units making up the
input layer.
 Hidden Layer: This layer lies after the input layer. Inputs fed from input layer are
then weighted and fed simultaneously to a hidden layer.
 Output Layer: It is a layer that contains output units or neurons and receives
processed data from the hidden layer. The weighted outputs of the last hidden layer
are input to units making up the output layer, which emits the network's prediction.
Defining a Network Topology
Before training can begin, the user must decide on the network topology by specifying
the number of units in the input layer, the number of hidden layers (if more than one), the
number of units in each hidden layer, and the number of units in the output layer.
Normalizing the input values for each attribute measured in the training tuples will
help speed up the learning phase.
There are no clear rules as to the ―best‖ number of hidden layer units. Network design
is a trial-and-error process and may affect the accuracy of the resulting trained network. The
initial values of the weights may also affect the resulting accuracy.

PG & Research Department Computer Science and Applications Page


82
Data Mining III- BCA

Once a network has been trained and its accuracy is not considered acceptable, it is
common to repeat the training process with a different network topology or a different set of
initial weights.
Backpropagation
Backpropagation work as follows
Backpropagation learns by iteratively processing a data set of training tuples,
comparing the network‘s prediction for each tuple with the actual known target value.
For each training tuple, the weights are modified so as to minimize the mean-squared
error between the network‘s prediction and the actual target value. These modifications are
made in the ―backwards‖ direction (i.e., from the output layer) through each hidden layer
down to the first hidden layer (hence the name backpropagation).
Algorithm: Backpropagation. Neural network learning for classification or numeric
prediction, using the backpropagation algorithm.
Input:
 D, a data set consisting of the training tuples and their associated target values;
 l, the learning rate;
 network, a multilayer feed-forward network.
Output: A trained neural network.
Method:

PG & Research Department Computer Science and Applications Page


83
Data Mining III- BCA

The training algorithm of backpropagation involves four stages which are as follows
 Initialization of weights − There are some small random values are assigned.
 Feed-forward − Each unit X receives an input signal and transmits this signal to each
of the hidden unit Z1, Z2,... Zn. Each hidden unit calculates the activation function and
sends its signal Z1 to each output unit. The output unit calculates the activation
function to form the response of the given input pattern.
 Backpropagation of errors − Each output unit compares activation Yk with the
target value Tk to determine the associated error for that unit. It is based on the error;
the factor is computed and is used to distribute the error at the output unit Yk back to
all units in the previous layer. Similarly the factor is compared for each hidden unit Zj.
 It can update the weights and biases.
Need for Backpropagation
Backpropagation is ―backpropagation of errors‖ and is very useful for training
neural networks. It‘s fast, easy to implement, and simple. Backpropagation does not require
any parameters to be set, except the number of inputs. Backpropagation is a flexible method
because no prior knowledge of the network is required.
Advantages
 It is simple, fast, and easy to program.
 Only numbers of the input are tuned, not any other parameter.
 It is flexible and efficient.
 No need for users to learn any special functions.
Disadvantages
 It is sensitive to noisy data and irregularities.
 Performance is highly dependent on input data.
 Spending too much time for training.

4.5 CLASSIFICATION METHODS


Genetic Algorithms
Genetic Algorithm is based on an analogy to biological evolution.
In general, genetic learning starts as follows
An initial population is created consisting of randomly generated rules. Each rule is
represented by a string of bits.
Example: if A1 and ¬A2 then C2 can be encoded as 100. If an attribute has k > 2
values, k bits can be used
PG & Research Department Computer Science and Applications Page
84
Data Mining III- BCA

Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules and their offsprings.
The fitness of a rule is represented by its classification accuracy on a set of training
examples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In
mutation, randomly selected bits in a rule‘s string are inverted.
Genetic algorithms are slow but easily parallelizable and have been used for
classification as well as other optimization problems.
Rough Set Approach
Rough set theory can be used for classification to discover structural relationships
within imprecise or noisy data.
It applies to discrete-valued attributes. Continuous-valued attributes must therefore be
discretized before its use.
Rough set theory is based on the establishment of equivalence classes within the
given training data.
Rough sets can be used to approximately or ―roughly‖ define such classes. A rough
set definition for a given class, C, is approximated by two sets—a lower approximation of C
and an upper approximation of C.
The lower approximation of C consists of all the data tuples that are certain to belong
to C without ambiguity. The upper approximation of C consists of all the tuples that cannot
be described as not belonging to C.
Fuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph).
Attribute values are converted to fuzzy values.
Example: income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated.
For a given new sample, more than one fuzzy value may apply.
Each applicable rule contributes a vote for membership in the categories.
Logistic Regression
Logistic Regression is a one the classification techniques in Data Mining.

PG & Research Department Computer Science and Applications Page


85
Data Mining III- BCA

The statistical method of creating a binomial result with one or more descriptive
variables is known as logistic regression.
This algorithm attempts to detect whether a variable instance belongs to a specific
category.
Regressions are commonly utilized in applications such as:
 Credit Score
 Estimate revenue for a specific product.
K-Nearest Neighbor
The K-nearest neighbor approach is one of the common classification techniques in
Data Mining that relies on the classification measure.
To begin, we train the algorithm with a collection of data. Following that, the distance
between the training and new data is calculated to categorize the new data.
This approach can be computationally costly depending on the size of the training set.
The K-NN algorithm will use the complete data set to produce a prediction.

Support Vector Machine (SVM)


Support Vector Machine and is a supervised Machine Learning technique for
classification, regression, and anomaly detection.
Classification Techniques in Data Mining such as SVMs work by determining the
optimum hyperplane for dividing a dataset into two classes.
The SVM method seeks to determine the distance between two object classes, with
the premise that the greater the distance, the more reliable the classification.

4.6 PREDICTION
To find a numerical output, prediction is used. The training dataset contains the inputs
and numerical output values.
According to the training dataset, the algorithm generates a model or predictor. When
fresh data is provided, the model should find a numerical output.
This approach, unlike classification, does not have a class label. A continuous-valued
function or ordered value is predicted by the model.
In most cases, regression is utilized to make predictions. For example: Predicting the
worth of a home based on facts like the number of rooms, total area, and so on.

PG & Research Department Computer Science and Applications Page


86
Data Mining III- BCA

4.7 CLASSIFIERS ACCURACY


The accuracy of a classification method is the ability of the method to correctly
determine the class of a randomly selected data.
It may be expressed as the probability of correctly classifying unseen data.
Estimating the accuracy of a supervised classification method can be difficult if only
the training data is available and all of that data has been used in building the model.
The accuracy estimation problem is much easier when much more data is available
than is required for training the model.
Different sets of training data would lead to different models and testing is very
important in determining how accurate each model.
Accuracy may be measured using a number of metrics, these include
 Sensitivity
 Specificity
 Precision
 Accuracy
A number of methods for estimating the accuracy of a method, they are
 Holdout Method
 Random sub-sampling Method
 K-fold cross-validation Method
 Leave-one-out Method
 Bootstrap Method
Holdout Method
The holdout method sometimes called test sample method. It requires a training set
and a test set.
It may be only one dataset is available which has been divided into two subsets (2/3 &
1/3).
Once classification method produces the model using the training set, the test set can
be used to estimate the accuracy.
While a large test set would produce a better estimate of the accuracy.
Random sub-sampling Method
This method is much like the holdout method except that it does not rely on a single
test set.

PG & Research Department Computer Science and Applications Page


87
Data Mining III- BCA

Random sub-sampling is likely to produce better error estimates than those by the
holdout method.
K-fold cross-validation Method
In K-fold cross validation, the available data is randomly divided into K disjoint
subsets of approximately equal size.
One of the subsets is then used as the test set and the remaining K-1 sets are used for
building the classifier.
The test set is then used to estimate the accuracy.
Leave-one-out Method
Leave-one-out is a simpler version of K-fold cross-validation.
In this method, one of the training samples is taken out and the model is generated
using the remaining training data.
Once the model is built, the one remaining sample is used for testing and the result is
coded as 1 or 0 depending if it was classified correctly or not.
This method is useful when the dataset is small.
It is unbiased but has high variance and is not reliable.
Bootstrap Method
A bootstrap sample is randomly selected uniformly with replacement by sampling n
times and used to build a model.
It can be shown that only 63.2% of these samples are unique.
The error in building the model is estimated by using the remaining 36.8 of objects
that are not in the bootstrap sample.
The bootstrap method is unbiased, has low variance but many iterations are needed
for good error estimates if the sample is small.

PG & Research Department Computer Science and Applications Page


88
Data Mining III- BCA

5. Clustering Techniques
Clustering is the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but are very dissimilar to objects
in other clusters.
Dissimilarities and similarities are assessed based on the attribute values describing
the objects and often involve distance measures.
Clustering as a data mining tool has its roots in many application areas such as
biology, security, business intelligence, and Web search.

5.1 CLUSTER ANALYSIS


Define: Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form
clusters which are similar to each other in the group but are different from the object in other
groups.
It is an unsupervised machine learning-based algorithm that acts on unlabelled data.
A group of data points would comprise together to form a cluster in which all the
objects would belong to the same group.
Define: Cluster
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
The given data is divided into different groups by combining similar objects into a
group. This group is known as a cluster.
The group of similar objects is called a Cluster.
The task is to convert the unlabelled data to labelled data and it can be done using
clusters.
What Is Good Clustering?
 A good clustering method will produce high quality clusters with
 High intra-class similarity
 Low inter-class similarity
 The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
 The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.

PG & Research Department Computer Science and Applications Page


89
Data Mining III- BCA

Applications of Cluster Analysis


 Clustering analysis is broadly used in many applications such as
o Market research
o Pattern recognition
o Data analysis
o Image processing
 Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent
to populations.
 Clustering also helps in identification of areas of similar land use in an earth
observation database.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit
card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
 Scalability − We need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
 Constraint-based clustering −Real-world applications may need to perform
clustering
under various kinds of constraints. A challenging task is to find data groups with good
clustering behavior that satisfy specified constraints

PG & Research Department Computer Science and Applications Page


90
Data Mining III- BCA

 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

5.2 CLUSTERING METHODS


Clustering methods can be classified into the following categories
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Partition methods obtain a single level partition of objects.
A partitioning method constructs k partitions of the data, where each partition
represents a cluster and k<=N ie. it classifies the data into k groups which together satisfy the
following requirements,
 Each group must contain at least one object
 Each object must belong to exactly one group
A partitioning method creates an initial partitioning, and then it uses an iterative
relocation technique that attempts to improve the partitioning by moving objects from one
group to another.
Hierarchical Methods
Hierarchical Method obtains a nested partition of the object resulting in a tree of
clusters. This method creates a hierarchical decomposition of the given set of data objects.
We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed.
There are two approaches here
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach.

PG & Research Department Computer Science and Applications Page


91
Data Mining III- BCA

The hierarchical method either start with each object in an individual cluster and then
try to merge similar clusters into large and large clusters are called agglomerative.
Divisive Approach
This approach is also known as the top-down approach.
The hierarchical method either start with one cluster and then split into smaller and
smaller clusters are called divisive.
Approaches to improve the quality of Hierarchical Clustering
Two approaches that are used to improve the quality of hierarchical clustering
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.
Density-based Method
Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms.
The data points in the region separated by two clusters of low point density are
considered as noise.
This method can deal with arbitrary shape clusters since the major requirement of
such method is that each cluster be a dense region of points surrounded by regions of low
density.
Density based methods are,
DBSCAN (Density Based Spatial Clustering of Application with Noise)
OPTICS (Ordering Points To identify the Clustering Structure)
LOF (Local Outlier Factors)
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Grid-based methods quantize the object space into a finite number of cells that form a
grid structure. All the clustering operations are performed on the grid structure
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space.

PG & Research Department Computer Science and Applications Page


92
Data Mining III- BCA

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It reflects
spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints can be specified by the user or the application requirement.

5.3 HIERARCHICAL METHODS


Hierarchical Methods in Data Mining
A hierarchical clustering method works by grouping data objects into a tree of
clusters.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters.
Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
Hierarchical clustering starts by treating each data points as an individual cluster. The
endpoint refers to a different set of clusters, where each cluster is different from the other
cluster, and the objects within each cluster are the same as one another.
Dendrogram
It is a tree structure diagram graphically represents the hierarchical relationship
between objects. It is most commonly created as an output from hierarchical clustering. The

PG & Research Department Computer Science and Applications Page


93
Data Mining III- BCA

main use of a dendrogram is to work out the best way to allocate objects to clusters

Types of hierarchical clustering methods


 Agglomerative: the hierarchical decomposition is formed in a bottom-up (merging)
fashion.
 Divisive: the hierarchical decomposition is formed in a top-down (splitting) fashion.
Agglomerative hierarchical clustering
Agglomerative clustering is one of the most common types of hierarchical clustering
used to group similar objects in clusters.
Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In
agglomerative clustering, each data point act as an individual cluster and at each step, data
objects are grouped in a bottom-up method.
Initially, each data object is in its cluster. At each iteration, the clusters are combined
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering
 Determine the similarity between individuals and all other clusters. (Find proximity
matrix).
 Consider each data point as an individual cluster.
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster.
 Repeat step 3 and step 4 until only a single cluster remains.
The graphical representation of this algorithm using a dendrogram
Let six data points A, B, C, D, E, and F

PG & Research Department Computer Science and Applications Page


94
Data Mining III- BCA

Figure – Agglomerative Hierarchical clustering

Divisive Hierarchical Clustering


Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical
clustering.
This top-down strategy starts with all objects in one cluster. It subdivides the cluster
into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies
certain termination conditions,
Termination conditions can be
 a desired number of clusters is obtained or
 the diameter of each cluster is within a certain threshold.
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method.

PG & Research Department Computer Science and Applications Page


95
Data Mining III- BCA

Figure – Divisive Hierarchical clustering

Advantages of Hierarchical clustering


 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.
Measures for Distance Between Clusters
Common measures for distance between clusters are as follows
 Minimum distance
 Maximum distance
 Mean distance
 Average distance
Minimum distance

When an algorithm uses the minimum distance, it is sometimes called a nearest-


neighbor clustering algorithm.
The pair of clusters having shortest distance is considered, if there exists the similarity
between two clusters, it is called a single-linkage algorithm.
Maximum distance

When an algorithm uses the maximum distance, it is sometimes called a farthest-


neighbor clustering algorithm.
In this method, the distance between one cluster and another cluster should be equal
to the greatest distance from any member of one cluster to any member of the other cluster, it
is called a complete-linkage algorithm.
Mean distance

PG & Research Department Computer Science and Applications Page


96
Data Mining III- BCA

The minimum and maximum measures tend to be overly sensitive to outliers or noisy
data.
The use of mean or average distance is a compromise between the minimum and
maximum distances and overcomes the outlier sensitivity problem.
Average distance

Whereas the mean distance is the simplest to compute, the Hierarchical Methods
average distance is advantageous in that it can handle categoric as well as numeric data.
The computation of the mean vector for categoric data can be difficult or impossible
to define.
In this method, the distance between one cluster and another cluster should be equal
to average distance from any member of one cluster to any member of the other cluster. It is
called Average linkage.
BIRCH
Multiphase Hierarchical Clustering Using Clustering Feature Trees Balanced Iterative
Reducing and Clustering using Hierarchies (BIRCH) is designed for clustering a large
amount of numeric data by integrating hierarchical clustering (at the initial microclustering
stage) and other clustering methods such as iterative partitioning (at the later macroclustering
stage).
It overcomes the two difficulties in agglomerative clustering methods:
 Scalability
 The inability to undo what was done in the previous step.
BIRCH uses the notions of clustering feature to summarize a cluster, and clustering
feature tree (CF-tree) to represent a cluster hierarchy.
These structures help the clustering method achieve good speed and scalability in
large or even streaming databases, and also make it effective for incremental and dynamic
clustering of incoming objects.
The ideas of clustering features and CF-trees have been applied beyond BIRCH.
 Clustering Feature (CF)

PG & Research Department Computer Science and Applications Page


97
Data Mining III- BCA

BIRCH summarizes large datasets into smaller, dense regions called Clustering
Feature (CF) entries.
The clustering feature (CF) of the cluster is a 3-D vector summarizing information
about clusters of objects.
Formally, a Clustering Feature entry is defined as an ordered triple,
CF = (N, LS, SS)
Where ‗N‘ is the number of data points in the cluster, ‗LS‘ is the linear sum of the
data points and ‗SS‘ is the squared sum of the data points in the cluster.
 CF Tree
A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering.

Figure CF-tree structure


The primary phases are
 Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which can
be viewed as a multilevel compression of the data that tries to preserve the data‘s
inherent clustering structure.
 Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of
the CF-tree, which removes sparse clusters as outliers and groups dense clusters into
larger ones.
Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to
determine the similarity between pairs of clusters.
In Chameleon, cluster similarity is assessed based on
 how well connected objects are within a cluster and
 the proximity of clusters.
That is, two clusters are merged if their interconnectivity is high and they are close
together. Thus, Chameleon does not depend on a static, user-supplied model and can
automatically adapt to the internal characteristics of the clusters being merged.

PG & Research Department Computer Science and Applications Page


98
Data Mining III- BCA

The merge process facilitates the discovery of natural and homogeneous clusters and
applies to all data types as long as a similarity function can be specified.

Figure 10.10 Chameleon: hierarchical clustering based on k-nearest neighbors and dynamic
modeling.
Chameleon works as,
 Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph.
 Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor
graph into a large number of relatively small subclusters such that it minimizes the
edge cut so as to minimize the weight of the edges.
 Chameleon then uses an agglomerative hierarchical clustering algorithm that
iteratively merges subclusters based on their similarity.
 To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity and the closeness of the clusters.
Specifically, Chameleon determines the similarity between each pair of clusters Ci
and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their relative closeness,
RC(Ci ,Cj).
 The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is
defined as the absolute interconnectivity between Ci and Cj , normalized with respect
to the internal interconnectivity of the two clusters, Ci and Cj.
 The relative closeness, RC(Ci ,Cj), between a pair of clusters, Ci and Cj , is the
absolute closeness between Ci and Cj , normalized with respect to the internal
closeness of the two clusters, Ci and Cj .

Probabilistic Hierarchical Clustering


Algorithmic hierarchical clustering methods using linkage measures tend to be easy to
understand and are often efficient in clustering. They are commonly used in many clustering
analysis applications.

PG & Research Department Computer Science and Applications Page


99
Data Mining III- BCA

However, algorithmic hierarchical clustering methods can suffer from several


drawbacks that is algorithmic hierarchical clustering is,
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal is not clear: heuristic, local search
Probabilistic hierarchical clustering aims to overcome some of these disadvantages by
using probabilistic models to measure distances between clusters.
Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered as a sample of the
underlying data generation mechanism to be analyzed.
 Easy to understand, same efficiency as algorithmic agglomerative clustering
method, can handle partially observed data.

5.4 DENSITY-BASED METHODS


Density-based clustering is a clustering method which discovers clusters of non
spherical shape.
Density based clustering algorithm has played a vital role in finding non linear shapes
structure based on the density
The density-based methods are based on the assumption that clusters are high density
collections of data of arbitrary shape that are separated by a large space of low density data
(which is assumed to be noise).
It is based on number of data points in density.
5.4.1 DBSCAN
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is most
widely used density based algorithm for clustering.
The method was designed for spatial databases but can be used in other applications.

It requires two input parameters,


 Size of the neighbourhood (R)
 The maximum points in neighbourhood (N)
These two parameters determine the density within the clusters the user is willing to
accept because they specify how many points must be in a region.

PG & Research Department Computer Science and Applications Page


100
Data Mining III- BCA

The size parameter R determines the size of the clusters found.


If R is big enough, there would be one big cluster and no outliers.
If R is small, there will be small dense clusters and there might be many
outliers.
It uses the concept of
 Density Reachability - A point "p" is said to be density reachable from a point "q" if
point "p" is within ε distance from point "q" and "q" has sufficient number of points in
its neighbors which are within distance ε.
 Density Connectivity - A point "p" and "q" are said to be density connected if there
exist a point "r" which has sufficient number of points in its neighbors and both the
points "p" and "q" are within the ε distance. This is chaining process.

Concept
Concepts are required in the DBSCAN method,
 Neighbourhood
 Core Object
 Proximity
 Connectivity
 Neighbourhood
The neighbourhood of an object y is defined as all the objects that are within
the radius R from y.
 Core Object
An object y is called a core object if there are N objects within its
neighbourhood.
 Proximity
Two objects are defined to be in proximity to each other if they belong to the
same cluster that is x1 is in proximity to object x2 when the objects are close enough to each
other and x2 is a core object.
 Connectivity
Two objects x1 and xn are connected if there is path or chain of objects x1, x2,
…., xn from x1 to xn.

Basic Algorithm for density-based clustering


1. Select values of R and N.

PG & Research Department Computer Science and Applications Page


101
Data Mining III- BCA

2. Arbitrarily select an object P.


3. Retrieve all objects that are connected to P, given R and N.
4. If P is a core object, a cluster is formed.
5. If P is a border object, no objects are in its proximity, choose another object Go to
step3.
Continue the process until all of the objects have been processed

Major Features of Density-Based Clustering


 It is a scan method.
 It requires density parameters as a termination condition.
 It is used to manage noise in data clusters.
 Density-based clustering is used to identify clusters of arbitrary size.
Advantages
 Does not require a-priori specification of number of clusters.
 Able to identify noise data while clustering.
 DBSCAN algorithm is able to find arbitrarily size and arbitrarily shaped clusters.
Disadvantages
 DBSCAN algorithm fails in case of varying density clusters.
 Fails in case of neck type of dataset.

5.4.2 OPTICS - A Cluster-Ordering Method


OPTICS stands for Ordering Points To Identify the Clustering Structure.
It gives a significant order of database with respect to its density-based clustering
structure.
The order of the cluster comprises information equivalent to the density-based
clustering related to a long range of parameter settings.
OPTICS methods are good for both automatic and interactive cluster analysis,
including determining an intrinsic clustering structure.
It can be represented graphically or using visualization techniques.

PG & Research Department Computer Science and Applications Page


102
Data Mining III- BCA

Figure Core-distance and reachability-distance

DENCLUE
Density-based clustering by Hinnebirg and Kiem.
It enables a compact mathematical description of arbitrarily shaped clusters in high
dimension state of data, and it is good for data sets with a huge amount of noise.
Major Features
 It ha got a solid mathematical foundation.
 It is definitely good for data sets with large amounts of noise.
 It allows a compact mathematical description of arbitrarily shaped clusters in high-
dimensional data sets.
 It is significantly faster than the existing algorithm
 But it needs a large number of parameters.
DENCLUE - Technical Essence
It uses grid cells but only keeps information about grid cells that do actually contain
data points and manages these cells in a tree-based access structure.
Influence function: This describes the impact of a data point within its neighborhood.
The Overall density of the data space can be calculated as the sum of the influence function
of all data points.
The Clusters can be determined mathematically by identifying density attractors.
The Density attractors are local maxima of the overall density function.

5.5 OUTLIER ANALYSIS

PG & Research Department Computer Science and Applications Page


103
Data Mining III- BCA

An outlier is a data object that deviates significantly from the rest of the objects, as if
it were generated by a different mechanism. They can be caused by measurement or
execution errors.
The analysis of outlier data to identify the behavior of the outliers is referred to as
outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not
being generated by the same method as the rest of the data objects.

Figure The objects in region R are outliers

Outlier detection is also related to novelty detection in evolving data sets. For
example, by monitoring a social media web site where new content is incoming, novelty
detection may identify new topics and trends in a timely manner.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any
variance in the previously measured variable called noise. Before finding the outliers present
in any data set, it is recommended first to remove the noise.
Types of Outliers
In general, outliers can be classified into three categories, namely
 Global (or point) outliers
 Contextual (or conditional) outliers
 Collective outliers
Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers.
If, in a given dataset, a data point strongly deviates from all the rest of the data points,
it is known as a global outlier.
Global outliers are sometimes called point anomalies.
Mostly, all of the outlier detection methods are aimed at finding global outliers.

PG & Research Department Computer Science and Applications Page


104
Data Mining III- BCA

Figure The green data point is a global outlier


Applications
Global outlier detection is important in many applications.
 Intrusion detection in computer networks: If a large number of packages are
broadcast in a very short span of time, then this may be considered as a global outlier
and we can say that that particular system has been potentially hacked.
 In trading transaction auditing systems, transactions that do not follow the
regulations are considered as global outliers and should be held for further
examination.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data
set is called collective outliers.
Here, the individual data objects may not be outliers, but when seen as a whole, they
may behave as outliers. To detect these types of outliers, we might need background
information about the relationship between those data objects showing the behavior of
outliers.

Figure The black objects form a collective outlier


Contextual Outliers
In a given data set, a data object is a contextual outlier if it deviates significantly with
respect to a specific context of the object.

PG & Research Department Computer Science and Applications Page


105
Data Mining III- BCA

Contextual outliers are also known as conditional outliers because they are
conditional on the selected context. Therefore, in contextual outlier detection, the context has
to be specified as part of the problem definition.
Contextual outlier analysis provides flexibility for users where one can examine
outliers in different contexts, which can be highly desirable in many applications.
The attributes of the data point are decided on the basis of both contextual and
behavioral attributes.

Figure A low temperature value in June is a contextual outlier because the same value in
December is not an outlier

Applications of outlier detection


Applications where outlier detection plays a major role are,
 Fraud detection in the telecom industry
 In market analysis, outlier analysis enables marketers to identify the customer's
behaviors.
 In the Medical analysis field.
 Fraud detection in banking and finance such as credit cards, insurance sector, etc.

PG & Research Department Computer Science and Applications Page


106
Data Mining III- BCA

6. Introduction to Advanced Topics

6.1 WEB MINING


Web Mining is the process of data mining techniques to automatically discover and
extract information from web documents and services.
The main purpose of web mining is discovering useful information from the World-
Wide Web and its usage patterns.
It is normally expected that either the hyperlink structure of the web or web log
data or both have been used in the mining process.
Process of Web Mining

Figure web mining process


Classifications of web mining
Web mining can be broadly divided into three different types of techniques of mining:
Web Content Mining, Web Structure Mining and Web Usage Mining.

PG & Research Department Computer Science and Applications Page


107
Data Mining III- BCA

Web content mining


Web content mining is the application of discovering and extracting useful
information from the content of the web documents (webpage).
Web content consist of several types of data such as text, image, audio, video etc.
Content data is the group of facts that a web page is designed. It can provide effective
and interesting patterns about user needs.
Text documents are related to text mining, machine learning and natural language
processing. This mining is also known as text mining.
This type of mining performs scanning and mining of the text, images and groups of
web pages according to the content of the input.
Algorithm
The algorithm proposed is called Dual Iterative Pattern Relation Extraction (DIPRE).
It works as follows,
1. Sample: start with a sample S provided by the user.
2. Occurrences: Find occurrences of tuples starting with those in S. Once tuples are
found the context of every occurrence is saved. Let these be O.
OS
3. Patterns: Generate patterns based on the set of occurrences O. This requires
generating patterns with similar contexts
PO
4. Match Patterns: The web is now searched for the patterns.
5. Stop if enough matches found, else go to step2.

Web Structure Mining


PG & Research Department Computer Science and Applications Page
108
Data Mining III- BCA

Web structure mining is the application of discovering structure information from the
web that deals with discovering and modeling the link structure of the web.
The structure of the web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages.
Structure mining basically shows the structured summary of a particular website.
It identifies relationship between web pages linked by information or direct link
connection. To determine the connection between two commercial websites, Web structure
mining can be very useful.
The aim of web structure mining is to discover the link structure or the model in web.
The model may be based on the topology of the hyperlinks.
Link structure is only one kind of information that may be used in analyzing the
structure of the web.

This can help in


 Discovering similarity between sites
 Discovering authority sites for a particular topic
 Discovering overview
 Survey sites that point to many authority sites( such sites are called hubs)
HITS
Kleinberg (1999) has developed a connectivity analysis algorithm called Hyperlink-
Induced Topic Search (HITS) based on the assumption that links represent human
judgments.
The HITS algorithm is based on the idea that if the creator of page p provides a link to
q then p confers some authority on page p.
In searching, a very large number of items are retrieved, one way to order these items
by the number of in-links.
The HITS algorithm has two major steps,
 Sampling step: It collects a set of relevant web Pages given a topic.
 Iterative step: it finds hubs and authorities using the information collected during
sampling.

Web Usage Mining


Web usage mining is the application of identifying or discovering interesting usage
patterns from large data sets. These patterns enable to understand the user behaviors.
PG & Research Department Computer Science and Applications Page
109
Data Mining III- BCA

In web usage mining, user access data on the web and collect data in form of logs. So,
Web usage mining is also called log mining.
Other aims of web usage mining are to obtain information and discover usage patterns
that assist website design or redesign, perhaps to assist navigation through the site.
The information collected in the web server logs usually includes information about
the access, referrer and agent (about browser).
Web usage mining may also involve collecting more information by using tools, they
are,
 Number of hits
 Number of visitors
 Visitor referring website
 Visitor referral website
 Entry point
 Visitor time and duration
 Path analysis
 Visitor IP address
 Browser type
 Platform
 Cookies
Log data analysis has been investigated using the following techniques are,
 Using association rules
 Using composite association rules
 Using cluster analysis
Applications of Web Mining
 Web mining helps to improve the power of web search engines by classifying web
documents and identifying web pages.
 It is used for Web Searching e.g., Google, Yahoo, etc,
 Web mining is used to predict user behavior.
 Web mining is very useful for a particular e-commerce website and e-service e.g.,
landing page optimization.

6.2 SPATIAL AND TEMPORAL DATA MINING


Spatial data
Spatial means space.
The data that provides information about a specific geographical area or location is
known as Spatial Data. It provides the information that helps identify the location of the
feature or the boundary of Earth.

PG & Research Department Computer Science and Applications Page


110
Data Mining III- BCA

Moreover, spatial data can be processed using GIS (Geographical Information


System) or Image processing packages.
Types of Spatial Data
The different types of Spatial Data are as follows
 Feature Data: Feature data follows the vector data model. It represents the entity of
the real world, i.e., roads, trees, buildings, etc. This information can be visually
represented in the form of a point, line, or polygon.
 Coverage Data: Coverage data follows the raster data model. Coverage Data contains
the mapping of continuous data in space and is represented as a range of values in a
satellite image, a digital surface model, aerial photographs, etc. The visual
representation of coverage data is in the form of a grid or triangulated irregular
network.
Spatial data must contain
 Latitude and longitude information.
 UTM easting or northing.
 Other coordinates denote a point‘s location in space, which helps in identifying a
location.
The emergence of spatial data and extensive usage of spatial databases has led to
spatial knowledge discovery.
Spatial data mining
Spatial data mining refers to the process of extraction of knowledge, spatial
relationships and interesting patterns from a spatial database.
Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.
Several tools are assisting in extracting information from geospatial data.
The general-purpose tools were preferably used to analyze scientific and engineering
data, astronomical data, multimedia data, genomic data, and web data.
Specific features of geographical data that prevent the use of general-purpose data
mining algorithms are,
 Spatial relationships among the variables
 Spatial structure of errors
 Observations that are not independent
 Spatial autocorrelation among the features

PG & Research Department Computer Science and Applications Page


111
Data Mining III- BCA

 Non-linear interaction in feature space


Spatial data mining tasks

Classification
Classification determines a set of rules which find the class of the specified object as
per its attributes.
Association rules
Association rules determine rules from the data sets, and it describes patterns that are
usually in the database.
Characteristic rules
Characteristic rules describe some parts of the data set.
Discriminate rules
Discriminate rules describe the differences between two parts of the database, such as
calculating the difference between two cities as per employment rate.
Challenges in spatial data mining
Challenges involved in spatial data mining include identifying patterns or finding
objects that are relevant in a large database field using GIS/GPS tools or similar systems.

Examples of application domains of spatial data mining


Domain Spatial data mining application

Public Safety Discovery of hotspot patterns from crime event maps

Epidemiology Detection of disease outbreak

Business Market allocation to maximize stores' profits

PG & Research Department Computer Science and Applications Page


112
Data Mining III- BCA

Neuroscience Discovering patterns of human brain activity from neuro-images

Climate Finding positive or negative correlations between temperatures of distance


Science places

6.3 TEMPORAL MINING


Temporal Data
Temporal means time.
Temporal Data is the data that represent a state in time. It is basically a temporary
data that is valid for a prescribed period of time. Data is collected at a particular time to
analyze weather patterns, monitor traffics, study demographics, etc.
Temporal data is useful for analyzing the change that is happening over a period of
time. This analysis is later used for identifying the potential cause of the changes and thus
come up with solutions.
Temporal data are a series of primary data types, generally numerical values.
Temporal data mining
Temporal data mining refers to the process of extraction of knowledge about the
occurrence of an event whether they follow, random, cyclic, seasonal variation, etc.
Main objective of Temporal Data Mining are,
 To analysis of temporal data to find temporal patterns, unexpected trends, or several
hidden relations in the higher sequential data, by utilizing a set of approaches from
machine learning, statistics, and database technologies.
 To find the temporal patterns, trends, and relations within the data and extract
meaningful information from the data to visualize how the data trend has changed
over a course of time.
Temporal data mining is composed of three major works such as
 Description of temporal data
 Representation of similarity measures
 Mining services
Temporal Data Mining can include the exploitation of efficient techniques of data
storage, quick processing, and quick retrieval methods that have been advanced for
temporal databases.

PG & Research Department Computer Science and Applications Page


113
Data Mining III- BCA

Temporal Data Mining includes the processing of time-series data, and sequences of
data to determine and compute the values of the same attributes over multiple time points.
Temporal data mining tasks
Tasks of temporal data mining are
 Data Characterization and Comparison
 Cluster Analysis
 Classification
 Association rules
 Prediction and Trend Analysis
 Pattern Analysis

Difference between Spatial and Temporal Data Mining

Characteristics Spatial Data Mining Temporal Data Mining


Concept Spatial Data Mining is the Temporal Data Mining is the extraction of
extraction of information and information from the temporal data to
relationships from geographical identify the pattern of the data.
data stored in a spatial database.
Base Logic It needs space information. It needs time information.
Type of Data Primarily, it deals with spatial data Primarily, it deals with implicit and
such as location, geo-referenced. explicit temporal content, form a huge set
of data.
Principle Based on rules like Association Based on finding patterns in the data by
rules, Discriminant rules, clusterin
characteristic rules, etc. g, association, prediction, and data
comparison
Examples Finding hotspots, unusual Understanding the weather changes over a
locations. period of time

PG & Research Department Computer Science and Applications Page


114
Data Mining III- BCA

PG & Research Department Computer Science and Applications Page


115

You might also like