0% found this document useful (0 votes)

11 views37 pages

CH 2

The document discusses the motivation and need for data mining due to the large amounts of data being collected. It describes data warehousing and data mining as a solution to analyze this data. It then provides an overview of key concepts in data mining, including the goals of data mining, common tasks like classification and clustering, and the multi-disciplinary nature of data mining.

Uploaded by

gauravkhunt110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views37 pages

CH 2

Uploaded by

gauravkhunt110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Unit 3

Motivation:
“Necessity is the Mother of Invention”
 Data Explosion Problem
 Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
 We are drowning in data, but starving for knowledge

 Solution: Data warehousing and data mining

 Data warehousing and on-line analytical processing

 Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases
Why data mining?
 Commercial point of view
 Data has become the key competitive advantage of companies
 Examples: Facebook, Google, Amazon
 Being able to extract useful information out of the data is key for
exploiting them commercially.
 Scientific point of view
 Scientists are at an unprecedented position where they can collect TB of
information
 Examples: Sensor data, astronomy data, social network data, gene data
 We need the tools to analyze such data to get a better understanding of
the world and advance science
 Scale (in data size and feature dimension)
 Why not use traditional analytic methods?
 Enormity of data, curse of dimensionality
 The amount and the complexity of data does not allow for manual
processing of the data. We need automated techniques.
What is Data Mining?
 Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns from data in large databases

 Alternative names and their “inside stories”:

 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
Knowledge Discovery in Databases(KDD)
KDD cistm
 Data cleaning: also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
 Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided
on and retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase
in which the selected data is transformed into forms appropriate for the
mining procedure.
 Data mining: it is the crucial step in which clever techniques are
applied to extract patterns potentially useful.
 Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the
data mining results.
Data Mining: Classification Schemes
 Decisions in data mining
 Kinds of databases to be mined

 Kinds of knowledge to be discovered

 Kinds of techniques utilized

 Kinds of applications adapted

Classification Criteria in Data Mining
 Databases to be mined
 Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
 Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine Visualization
Data Mining
Learning

Pattern Other
Recognition Algorithm Disciplines
Data Mining Tasks
 Prediction Tasks
 Use some variables to predict unknown or future values of other
variables.
 Description Tasks
 characterize the general properties of the data in the database.
Common data mining tasks
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation/Anamoly Detection [Predictive]
Data mining functionalities
 Data characterization: Data characterization is a
summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified
class are typically collected by a database query.

 For example, one may wish to characterize the customers of a

store who regularly rent more than 30 movies a year. With a data
cube containing summarization of data, simple OLAP
operations fit the purpose of data characterization.
Data mining functionalities
 Discrimination: Data discrimination produces what are called
discriminant rules and is basically the comparison of the general
features of objects between two classes referred to as the target class
and the contrasting class.

 For example, one may wish to compare the general characteristics of

the customers who rented more than 30 movies in the last year with
those whose rental account is lower than 5.

 The techniques used for data discrimination are similar to the

techniques used for data characterization with the exception
that data discrimination results include comparative measures.
Data mining functionalities
Mining Frequent Patterns, Associations, and
Correlations
 Frequent patterns, as the name suggests, are patterns that occur
frequently in data. There are many kinds of frequent patterns,
including:
 itemsets,
 Subsequences,
 substructures.
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Data mining functionalities
 Association analysis: Association analysis studies the
frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent
item sets. Another threshold, confidence, which is the
conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association rules.

 Association analysis is widely used for market basket or

transaction data analysis.
Data mining functionalities
 An example of such a rule, mined from the AllElectronics transactional
database, is

buys(X; “computer”)=>buys(X; “software”) [support = 1%;

confidence = 50%]

 where X is a variable representing a customer. A confidence, or

certainty, of 50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis
showed that computer and software were purchased together.

 This association rule involves a single attribute or predicate (i.e.,

buys) that repeats. Association rules that contain a single
predicate are referred to as single-dimensional association rules.
Data mining functionalities
 A data mining system may find association rules like
age(X, “20:::29”)^income(X, “20K:::29K”)=>buys(X, “CD player”)
[support = 2%, confidence = 60%]

 Note that this is an association between more than one attribute, or

predicate (i.e., age, income, and buys). each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional
association rule.

 Typically, association rules are discarded as uninteresting if they do not

satisfy both a minimum support threshold and a minimum confidence
threshold. Additional analysis can be performed to uncover interesting
statistical correlations between associated attribute-value pairs.
Data mining functionalities
 Classification : Classification is the process of finding a model (or
function) that describes and distinguishes data classes or concepts, for
the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on
the analysis of a set of training data (i.e., data objects whose class label
is known).

 Classification model can be represented in various forms such as

 IF-THEN Rules
 A decision tree
 Neural network
 Support Vector Machine(SVM)
 Bayesian Classification
Classification vs. Prediction
 Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying attribute
and uses it in classifying new data

 Prediction models continuous-valued functions, i.e., predicts

unknown or missing values

 Typical applications
 Credit/loan approval
 Target Marketing
 Medical diagnosis :if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes Each
tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute The set of tuples used for model
construction is training set. The model is represented as classification
rules, decision trees, or mathematical formulae

 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model The known label of test sample

is compared with the classified result from the model Accuracy rate
is the percentage of test set samples that are correctly classified by
the model
Classification
Classification
Classification
Clustering
 Clustering: Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters.

 However, unlike classification, in clustering, class labels are unknown

and it is up to the clustering algorithm to discover acceptable classes.

 Clustering is also called unsupervised classification, because the

classification is not dictated by given class labels.

 There are many clustering approaches, all based on the principle of

maximizing the similarity between objects in a same class (intra-class
similarity) and minimizing the similarity between objects of different
classes (inter-class similarity).
Clustering
Applications of Cluster
Analysis
Understanding –
Group related documents for
browsing,
group genes and proteins that
have similar functionality,
or group stocks with similar
price fluctuations

Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Supervised vs. Unsupervised
Learning
 Supervised learning (classification) :
 The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
 New data is classified based on the training set

 Unsupervised learning (clustering)

 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Data mining functionalities
 Outlier analysis: Outliers are data elements that cannot be grouped in
a given class or cluster. Also known as exceptions or surprises, they are
often very important to identify. While outliers can be considered noise
and discarded in some applications, they can reveal important
knowledge in other domains, and thus can be very significant and their
analysis valuable.

 Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same
account.

 Outlier values may also be detected with respect to the location and
type of purchase, or the purchase frequency.
Data Mining Architecture
 Data mining is a very important process where potentially useful
and previously unknown information is extracted from large
volumes of data.
 There are a number of components involved in the data mining
process. These components constitute the architecture of a data
mining system.
 The major components of any data mining system are data
source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and
knowledge base
Data Mining Architecture
Data Mining Architecture
 Data Sources
 Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large
volumes of historical data for data mining to be successful.
 Different Processes
 The data needs to be cleaned, integrated and selected before
passing it to the database or data warehouse server.
 Database or Data Warehouse Server
 The database or data warehouse server contains the actual data that
is ready to be processed. Hence, the server is responsible for
retrieving the relevant data based on the data mining request of the
user.
Data Mining Architecture
 Data Mining Engine
 It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.
 Pattern Evaluation Modules
 The pattern evaluation module is mainly responsible for the measure of
interestingness of the pattern by using a threshold value. It interacts
with the data mining engine to focus the search towards interesting
patterns.
 Graphical User Interface
 The graphical user interface module communicates between the user
and the data mining system. This module helps the user use the system
easily and efficiently without knowing the real complexity behind the
process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily
understandable manner.
Data Mining Architecture
 Knowledge Base
 The knowledge base is helpful in the whole data mining process. It
might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might
even contain user beliefs and data from user experiences that can be
useful in the process of data mining. The data mining engine might
get inputs from the knowledge base to make the result more
accurate and reliable.
Data Mining Issues
 Major issues in data mining are partitioned in five
groups:
 Mining methodology
 User interaction
 Efficiency and scalability
 Diversity of data types
 Data mining and society
Data Mining Issues
 Mining methodology
 Mining different kinds of knowledge in databases
 mining of knowledge at multiple levels of abstraction
 Handling noisy or incomplete data
 Pattern evaluation
 User interaction
 Interactive mining
 Incorporation of background knowledge
 Query languages and ad hoc mining
 Presentation and visualization of data mining results
Data Mining Issues
 Efficiency and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, and incremental mining algorithms
 Diversity of data types
 Handling of relational and complex types of data
 Mining information from heterogeneous databases and global
information systems
 Data mining and Society
 Social impact of data mining
 Privacy preserving in data mining
 Invisible data mining
Data Mining Applications
Here is the list of areas where data mining is widely used
 Healthcare and Insurance
 Measuring Treatment Effectiveness – This application of data mining
involves comparing and contrasting symptoms, causes and courses of
treatment to find the most effective course of action for a certain illness or
condition. For example, patient groups who are treated with different drug
regimens can be compared to determine which treatment plans work best
and save the most money.

 Detecting Fraud and Abuse – This involves establishing normal patterns,

then identifying unusual patterns of medical claims by clinics, physicians,
labs, or others. This application can also be used to identify inappropriate
referrals or prescriptions and insurance fraud and fraudulent medical
claims. The Texas Medicaid Fraud and Abuse Detection System is a good
example of a business using data mining to detect fraud.
Data Mining Applications
 Education
 Concerns with developing methods that discover knowledge from data
originating from educational Environments.
 The goals is identified as predicting students’ future learning behavior,
studying the effects of educational support. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the
student. With the results the institution can focus on what to teach and
how to teach.

 Retail Industry
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
 Market basket analysis
Data Mining Applications
 Banking/Finance
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
 Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus
to anomaly detection. It helps an analyst to distinguish an activity from
common everyday network activity.
 Monitoring and analyzing traffic
 Identifying abnormal activity
 Other applications are:
 Bio Informatics
 Crime agencies
 Scientific Applications

Salesforce Integration Bootcamp Notes
No ratings yet
Salesforce Integration Bootcamp Notes
22 pages
Website of Library Management System (Aak)
No ratings yet
Website of Library Management System (Aak)
48 pages
Unit 1
No ratings yet
Unit 1
59 pages
Data Mining(Introduction)
No ratings yet
Data Mining(Introduction)
31 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Archana Data Mining
No ratings yet
Archana Data Mining
24 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit 3 BI & Data Science
No ratings yet
Unit 3 BI & Data Science
19 pages
DMDW Unit1
No ratings yet
DMDW Unit1
31 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining
No ratings yet
Data Mining
88 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
Data Mining
No ratings yet
Data Mining
254 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Data Mining
No ratings yet
Data Mining
63 pages
Lecture 2 Data Mining Functions
No ratings yet
Lecture 2 Data Mining Functions
40 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
MR22-DM 1
No ratings yet
MR22-DM 1
21 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Week-1-Introduction to Data Mining
No ratings yet
Week-1-Introduction to Data Mining
43 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data Mining
No ratings yet
Data Mining
27 pages
02 DM BI Data Mining
No ratings yet
02 DM BI Data Mining
66 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
FALLSEM2025-26_VL_ISWE209L_00100_TH_2025-07-31_Course-Material-for-Module-1 (4)
No ratings yet
FALLSEM2025-26_VL_ISWE209L_00100_TH_2025-07-31_Course-Material-for-Module-1 (4)
31 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
unit 1 and 2
No ratings yet
unit 1 and 2
145 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining
No ratings yet
Data Mining
52 pages
Data Mining: Knowledge Discovery in Databases
No ratings yet
Data Mining: Knowledge Discovery in Databases
21 pages
IS352 - Lecture 01
No ratings yet
IS352 - Lecture 01
62 pages
Module 1
No ratings yet
Module 1
41 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Unit 2 Introduction To Data Mining
No ratings yet
Unit 2 Introduction To Data Mining
38 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Unit - I
No ratings yet
Unit - I
22 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
Data Mining
No ratings yet
Data Mining
30 pages
Unit 3
No ratings yet
Unit 3
23 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Unit III
No ratings yet
Unit III
101 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Kunal 2
No ratings yet
Kunal 2
14 pages
Nireekshith BR & Darshan K
No ratings yet
Nireekshith BR & Darshan K
5 pages
Spring and Spring Boot Related Interview Questions
No ratings yet
Spring and Spring Boot Related Interview Questions
5 pages
SHUKLAA
100% (1)
SHUKLAA
43 pages
Switch Manager: SM/EN GL/B11
No ratings yet
Switch Manager: SM/EN GL/B11
38 pages
Fingerprint For Time & Attendance Control
No ratings yet
Fingerprint For Time & Attendance Control
2 pages
Instructions/Notes:: Question # 1: 20 Marks
No ratings yet
Instructions/Notes:: Question # 1: 20 Marks
1 page
Ruggedcom Crossbow: Secure Access Management Solution
No ratings yet
Ruggedcom Crossbow: Secure Access Management Solution
7 pages
Performance Monitoring
No ratings yet
Performance Monitoring
11 pages
Argo 70rm Ingl
No ratings yet
Argo 70rm Ingl
19 pages
STAT Online Test Step-by-Step Guide19
No ratings yet
STAT Online Test Step-by-Step Guide19
17 pages
SANS - 0230920 - Sysdig - Updated - Buyers - Guide - FINAL
No ratings yet
SANS - 0230920 - Sysdig - Updated - Buyers - Guide - FINAL
17 pages
CROQUI 90BR Eloy Menezes
No ratings yet
CROQUI 90BR Eloy Menezes
1 page
Unit 5 1 Basicblocks
No ratings yet
Unit 5 1 Basicblocks
39 pages
Cursor Movement Using Hand Gesture
No ratings yet
Cursor Movement Using Hand Gesture
10 pages
TCP Timers: Time Out Timer
No ratings yet
TCP Timers: Time Out Timer
2 pages
VM Multiplex
No ratings yet
VM Multiplex
31 pages
What Is Malware - and Its Types - GeeksforGeeks
No ratings yet
What Is Malware - and Its Types - GeeksforGeeks
16 pages
Andreas Gadatsch - IT Controlling - From IT Cost and Activity Allocation To Smart Controlling-Springe
No ratings yet
Andreas Gadatsch - IT Controlling - From IT Cost and Activity Allocation To Smart Controlling-Springe
180 pages
Gantt Chart Template - Ods
No ratings yet
Gantt Chart Template - Ods
12 pages
Niton XL5 - Spec Sheet
No ratings yet
Niton XL5 - Spec Sheet
2 pages
GEC LIE Report Group 1
No ratings yet
GEC LIE Report Group 1
17 pages
Introduction To Virtualization and VMware Solutions
No ratings yet
Introduction To Virtualization and VMware Solutions
21 pages
Chapter 8
No ratings yet
Chapter 8
14 pages
Punjab University, College of Information and Technology: Department of Computer Science
No ratings yet
Punjab University, College of Information and Technology: Department of Computer Science
4 pages
CS Project
No ratings yet
CS Project
15 pages
Chế Độ Cắt Gia Công Cơ Khí - Nguyễn Ngọc Đào, 256 Trang
No ratings yet
Chế Độ Cắt Gia Công Cơ Khí - Nguyễn Ngọc Đào, 256 Trang
256 pages
Web Based Agri Tourism Information Management With Tour Scheduling Design Hearing Presentation
No ratings yet
Web Based Agri Tourism Information Management With Tour Scheduling Design Hearing Presentation
50 pages

CH 2

Uploaded by

CH 2

Uploaded by

Unit 3

 Solution: Data warehousing and data mining

 Data warehousing and on-line analytical processing

 Extraction of interesting knowledge (rules, regularities, patterns,

 Alternative names and their “inside stories”:

 Kinds of knowledge to be discovered

 Kinds of techniques utilized

 Kinds of applications adapted

 For example, one may wish to characterize the customers of a

 For example, one may wish to compare the general characteristics of

 The techniques used for data discrimination are similar to the

 Association analysis is widely used for market basket or

buys(X; “computer”)=>buys(X; “software”) [support = 1%;

 where X is a variable representing a customer. A confidence, or

 This association rule involves a single attribute or predicate (i.e.,

 Note that this is an association between more than one attribute, or

 Typically, association rules are discarded as uninteresting if they do not

 Classification model can be represented in various forms such as

 Prediction models continuous-valued functions, i.e., predicts

 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model The known label of test sample

 However, unlike classification, in clustering, class labels are unknown

 Clustering is also called unsupervised classification, because the

 There are many clustering approaches, all based on the principle of

 Unsupervised learning (clustering)

 Outlier analysis may uncover fraudulent usage of credit cards by

 Detecting Fraud and Abuse – This involves establishing normal patterns,

You might also like