0% found this document useful (0 votes)
29 views

Data Mining

Fg

Uploaded by

ashit.mk432
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Mining

Fg

Uploaded by

ashit.mk432
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

1.a) What is Data Mining? Discuss the major issues in Data mining.

Ans:- Data mining is one of the most useful techniques that help us to extract valuable
information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).

The process of extracting information to identify patterns, trends, and useful data
from huge sets of data is called Data Mining.

major issues in Data mining:-


Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues.

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
https://www.tutorialspoint.com/data_mining/images/dm_issues.jpg

Mining Methodology and User Interaction Issues


 Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
 Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not
only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions which
is further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining the
data again from scratch.

Diverse Data Types Issues


 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

b) What is Data Mining functionalities? Explain different types of


Data Mining functionality with example.
Ans:- Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
Data mining deals with the kind of patterns that can be mined. On the basis of the
kind of data to be mined, there are two categories of functions involved in Data
Mining −

 Descriptive
 Classification and Prediction

Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions –
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and
printers, and concepts of customers include big spenders and budget spenders.
Such descriptions of a class or a concept are called class/concept descriptions
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.

Classification and Prediction


1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. It predicts the class of objects whose class label is unknown.

2. Prediction :
It is used to predict missing or unavailable numerical data values rather than class
labels

3.Outlier Analysis:
Outliers may be defined as the data objects that do not comply with the general
behavior or model of the data available.

4.Evolution Analysis:
Evolution analysis refers to the description and model regularities or trends for
objects whose behavior changes over time.

2. a) Draw the diagram and explain the architecture of Data Mining


system.
Ans: Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents. You need a huge amount of historical data for data
mining to be successful

Different processes:

Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources
and in different formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate. Several methods may be
performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:


The database or data warehouse server consists of the original data that is ready to
be processed. Hence, the server is cause for retrieving the relevant data that is based
on data mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains
several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture.

Pattern Evaluation Module:


The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value. It collaborates with the data
mining engine to focus the search on exciting patterns.

Graphical User Interface:


The graphical user interface (GUI) module communicates between the data mining
system and the user. This module helps the user to easily and efficiently use the
system without knowing the complexity of the process.

Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The data
mining engine may receive inputs from the knowledge base to make the result more
accurate and reliable.

b) What is Data warehouse? How is it different from an operational


database? Explain data marts.
A data warehouse is constructed by integrating data from multiple heterogeneous
sources.
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.

Operational Database Data Warehouse

Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned with Data warehousing systems are usually
current data. concerned with historical data.

Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business
processes. measures by subject area, categories, and
attributes.

It supports thousands of concurrent clients. It supports a few concurrent clients relative


to OLTP.

Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented

Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of associatively optimized to perform fast retrievals of
small volumes of data. relatively high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)

DATA MART:
Data mart is a smaller form of data warehouse, which serves some specific needs on
data analysis. It is usually derived as a small part from the bigger data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts


There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

3. a) Discuss 3 tier data warehouse architecture and explain ROLAP,


MOLAP and HOLAP servers.
Answer- A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients computing
within the enterprise.
Such applications gather detailed data from day to day operations.

1.Three-Tier Architecture-
The three-tier architecture consists of the source layer (containing multiple source system),
the reconciled layer and the data warehouse layer (containing both data warehouses and
data marts). The reconciled layer sits between the source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases, the
reconciled layer is also directly used to accomplish better some operational tasks, such as
producing daily reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from being
real-time.

Data Warehouse Architecture


Basis ROLAP MOLAP HOLAP

Multidimensional
Relational Database is used
Storage Database is used Multidimensional as storage
location for as storage location Database is used as location for
summary for summary storage location for summary
aggregation aggregation. summary aggregation. aggregation.

Processing time of
Processing ROLAP is very Processing time of Processing time
time slow. MOLAP is fast. of HOLAP is fast.

Small storage
Large storage space
space requirement requirement in
in ROLAP as Medium storage space HOLAP as
compare to requirement in MOLAP compare to
Storage space MOLAP and as compare to ROLAP MOLAP and
requirement HOLAP. and HOLAP. ROLAP.

Relational
Relational Multidimensional database is used
Storage database is used database is used as as storage
location for as storage location storage location for location for detail
detail data for detail data. detail data. data.

Latency Low latency in High latency in Medium latency


ROLAP as MOLAP as compare to in HOLAP as
compare to compare to
MOLAP and MOLAP and
HOLAP. ROLAP and HOLAP. ROLAP.

Slow query Medium query


response time in response time in
ROLAP as Fast query response HOLAP as
compare to time in MOLAP as compare to
Query MOLAP and compare to ROLAP MOLAP and
response time HOLAP. and HOLAP. ROLAP.

b) Explain the term OLAP. Discuss data cube technology in brief.


Online Analytical Processing Server (OLAP) is based on the multidimensional data
model. It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information.

Types of OLAP Servers


We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be low
if the data set is sparse. Therefore, many MOLAP server use two levels of data
storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to
store the large data volumes of detailed information. The aggregations are stored
separately in MOLAP store.

Specialized SQL Servers


Specialized SQL servers provide advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.

How OLAP Works?


Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries
that are typically very hard to execute over tabular databases, namely aggregation,
joining, and grouping. These queries are calculated during a process that is usually
called 'building' or 'processing' of the OLAP cube. This process happens overnight,
and by the time end users get to work - data will have been updated.

Data Cube
When data is grouped or combined in multidimensional matrices called Data Cubes.
The data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."

For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.

4. a) Write in details about the key components of data


warehouse.

Components or Building Blocks of Data Warehouse


Architecture is the proper arrangement of the elements. We build a data warehouse
with software and hardware components.
n our circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source
Data component shows on the left. The Data staging element serves as the next
building block. In the middle, we see the Data Storage component that handles the
data warehouses data. This element not only stores and manages the data; it also
keeps track of data using the metadata repository. The Information Delivery
component shows on the right consists of all the different ways of making the
information from the data warehouses available to the users.

Source Data Component


Source data coming into the data warehouses may be grouped into four broad
categories:

Production Data: This type of data comes from the different operating systems of
the enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part
of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.

External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.

Data Staging Component


After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and
made ready in a format that is relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

Data Storage Components:-Data storage for the data warehousing


is a split repository. The data repositories for the operational systems generally
include only the current data. Also, these data repositories include the data
structured in highly normalized for fast and efficient processing.

Information Delivery Component


The information delivery element is used to enable the process of subscribing for
data warehouse files and having it transferred to one or more destinations according
to some customer-specified scheduling algorithm.

Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.

Data Marts
Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to developed a data warehouse with several
smaller related data marts for particular kinds of queries and reports.

Management and Control Component


The management and control elements coordinate the services and functions within
the data warehouse. These components control the data transformation and the data
transfer into the data warehouse storage
b) What is data transformation? Explain the different methods of data
transformation.

DATA TRANSFORMATION IN DATA MINING: The data transformation in


data mining is accomplished using a combination of structured and unstructured data. It is
transferred to a cloud data warehouse and arranged homogeneously to make it easier to
recognize patterns. Here are the steps involved:

Smoothing:

Smoothing is a process used to remove the unnecessary, corrupt or meaningless data or


‘noise’ in a dataset. Smoothing improves the algorithm’s ability to detect useful patterns in
data.

Aggregation:

Data aggregation is gathering data from a number of sources and storing it in a single format.
Aggregation, in itself, is a process of improving the quality of the data where it helps gather
info about data clusters and collect lots of data.

Discretization:

Discretization is one of the transformation methods that break up continuous data into small
intervals. Although data mining requires continuous data, the existing frameworks can only
handle discrete data chunks.

Attribute construction:

In attribute construction, new attributes are generated and applied in the mining process from
the existing set of attributes. It improves mining efficiency by simplifying the original data.

Generalization:

Generalization is used to convert low-level data attributes to high-level data attributes by the
use of concept hierarchy. An example is an age in the numerical form of raw data (22, 52) is
converted into (Young, old) categorical value.

Normalization:

Normalization is an important step in data transformation and also called pre-processing.


Here the data is transformed to categorize it under a given range.

5. a) Explain the data discretization and concept hierarchy


generation
Discretization in data mining
Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In
other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss. There are two
forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which the
class data is used. Unsupervised discretization refers to a method depending upon
the way which operation proceeds.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

22.6M
387
Hello Java Program for Beginners

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under
country level.

Data discretization and concept hierarchy


generation
The term hierarchy represents an organizational structure or mapping in which items
are ranked according to their levels of importance. In other words, we can say that a
hierarchy concept refers to a sequence of mappings with a set of more general
concepts to complex concepts. It means mapping is done from low-level concepts to
high-level concepts. For example, in computer science, there are different types of
hierarchical systems.
There are two types of hierarchy: top-down mapping and the second one is bottom-
up mapping.

A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general information and
ends with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized
information and ends with the top to the generalized information.

b) How can we apply data reduction technique? Explain.


6. a) Explain in detail how we generate association rules from frequent
itemsets with the help of an example.

Transactional Data for an AllElectronics Branch

TID List of item IDs

T100 I1, I2, I5

T200 I2, I4

T300 I2, I3

T400 I1, I2, I4

T500 I1, I3

T600 I2, I3

T700 I1, I3

T800 I1, I2, I3, I5

T900 I1, I2, I3

Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is

straightforward to generate strong association rules from them (where strong association rules
satisfy both minimum support and minimum confidence). This can be done

using Eq. (6.4) for confidence, which we show again here for completeness:

confidence(A ⇒ B) = P(B|A) =support count(A ∪B)/support count(A)

The conditional probability is expressed in terms of itemset support count, where

support count(A ∪B) is the number of transactions containing the itemsets A ∪B, and

support count(A) is the number of transactions containing the itemset A. Based on this

equation, association rules can be generated as follows:

For each frequent itemset l, generate all nonempty subsets of l.

For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support count(l)

support count(s) ≥
min conf, where min conf is the minimum confidence threshold.

Because the rules are generated from frequent itemsets, each one automatically satis-

fies the minimum support. Frequent itemsets can be stored ahead of time in hash tables

along with their counts so that they can be accessed quickly.

Example 6.4 Generating association rules. Let’s try an example based on the transactional
data for

AllElectronics shown before in Table 6.1. The data contain frequent itemset X = {I1, I2,

I5}. What are the association rules that can be generated from X? The nonempty subsets

of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting association rules are

as shown below, each listed with its confidence:

{I1, I2} ⇒ I5, confidence = 2/4 = 50%

{I1, I5} ⇒ I2, confidence = 2/2 = 100%

{I2, I5} ⇒ I1, confidence = 2/2 = 100%

I1 ⇒ {I2, I5}, confidence = 2/6 = 33%

I2 ⇒ {I1, I5}, confidence = 2/7 = 29%

I5 ⇒ {I1, I2}, confidence = 2/2 = 100%

If the minimum confidence threshold is, say, 70%, then only the second, third, and

last rules are output, because these are the only ones generated that are strong. Note

that, unlike conventional classification rules, association rules can contain more than

one conjunct in the right side of the rule

b) Describe in detail the Apriori algorithm. List various ways used to


improve the efficiency of Apriori algorithm.

Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other words,
we can say that the apriori algorithm is an association rule leaning that analyzes that people
who bought product A also bought product B Apriori algorithm helps the customers to buy
their products with ease and increases the sales performance of the particular store.

Components of Apriori algorithm—

The given three components comprise the apriori algorithm.

 Support

 Confidence

 Lift

Support-- Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the total
number of transactions. Hence, we get Support (Biscuits) = (Transactions relating biscuits) /
(Total transactions) = 400/4000 = 10 percent.

Confidence-- Confidence refers to the possibility that the customers bought both biscuits
and chocolates together. So, you need to divide the number of transactions that comprise both
biscuits and chocolates by the total number of transactions to get the confidence. Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits) = 200/400 = 50 percent. It means that 50 percent of customers who
bought biscuits bought chocolates also.

 Lift-- Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given below. Lift =
(Confidence (Biscuits - chocolates)/ (Support (Biscuits) = 50/10 = 5 It means that the
probability of people buying both biscuits and chocolates together is five times more than that
of purchasing the biscuits alone. If the lift value is below one, it requires that the people are
unlikely to buy both the items together. Larger the value, the better is the combination.

Advantages of Apriori Algorithm—

 It is used to calculate large itemsets.

 Simple to understand and apply.

Disadvantages of Apriori Algorithms—

 Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.

 Sometimes, you need a huge number of candidate rules, so it becomes computationally


more expensive.
List the techniques to improve the efficiency of Apriori algorithm

 Hash based technique

 Transaction

 Reduction

 Portioning

 Sampling

 Dynamic item counting

Hash-based technique (hashing itemsets into corresponding buckets): A


hash-basedtechnique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1.

Transaction reduction (reducing the number of transactions scanned in


future iterations): A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets.

Partitioning (partitioning the data to find candidate itemsets): A partitioning


technique can be used that requires just two database scans to mine the frequent itemsets

Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for frequent
itemsets in S instead of D.

Dynamic itemset counting (adding candidate itemsets at different points


during a scan): A dynamic itemset counting technique was proposed in which the
database is partitioned into blocks marked by start points.

7. a) Discuss the detailed concept of mining multi-dimensional association


rules from relational databases.
Mining Multidimensional Association Rules from Relational Databases and Data
Warehouses

Single dimensional or intra dimensional association rule contains a single distinct predicate
(e.g., buys)with multiple occurrences i.e., the predicate occurs more than once within the rule.

buys(X, “digital camera”)=>buys(X, “HP printer”)


Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
Above Rule contains three predicates (age, occupation, and buys), each of which occurs only
once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called interdimensional
association rules.
We can also mine multidimensional association rules with repeated predicates, which contain
multiple occurrences of some predicates. These rules are called hybrid-dimensional
association rules. An example of such a rule is the following, where the predicate buys is
repeated:
age(X, “20…29”)^buys(X, “laptop”)=>buys(X, “HP printer”)
Techniques for mining multidimensional association rules can be categorized into
two basic approaches regarding the treatment of quantitative attributes.

In the first approach, quantitative attributes are discretized using predefined


concept hierarchies. This discretization occurs before mining.

In the second approach, quantitative attributes are discretized or clustered into


“bins” based on the data distribution.

b) What is the difference between clustering and prediction? List out some
of the issues regarding clustering and prediction.

8. a) Define Clustering. Also mention the various requirements and


applications of clustering
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different
groups.

Applications of Cluster Analysis


 Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the
purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in
a city according to house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information
discovery.
 Clustering is also used in outlier detection applications such as detection of
credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical)
data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should not
be bounded to only distance measures that tend to find spherical cluster of
small sizes.
 High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
 Interpretability − The clustering results should be interpretable,
comprehensible, and usable.

b) Write short notes. i) K-means clustering ii) K-medoids


algorithms

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for
any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

K-Medoids clustering

K-Medoids (also called as Partitioning Around Medoid) algorithm was


proposed in 1987 by Kaufman and Rousseeuw. A medoid can be defined as
the point in the cluster, whose dissimilarities with all the other points in the
cluster is minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E =
|Pi - Ci|

The cost in K-Medoids algorithm is given as


Algorithm:
1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common
distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest
medoid, recompute the cost.
2. If the total cost is more than that in the previous step, undo the
swap.

You might also like