1
Mrs. Dipali Meher
Modern College of Arts, Science and Commerce,
Ganeshkhind, Pune 411016
Data Mining : An Introduction
2
Bayes Thm(1763)
Regression(1805)
KDD(1989)
Support Vector Machine(1992)
Data Science(2001)
Moneyball(2003)
Turing(1963)
Neural Networks(1943)
Evolutionary Computation(1965)
Databases(1970)
Genetic Algorithms(1975)
Big Data
From Then till Now…..
3
DBMS
RDBMS
Distributed DBMS
Data Mining
4
Data Mining deals with the discovery of
hidden Knowledge , unexpected pattern
and new rules from large data sets
5
Examples of Information extracted using query
language
 List customers who use credit card to purchase
more than Rs. 10000 worth groceries
 List patients who had at least one heart attack
 List students who had at least one backlog
 List employees who have taken home loans
6
Examples of what data mining is used for
 Develop a general profile of credit card customers
 Determine patients whose lifestyle is prone to getting a
heart attack in near future
 Differentiate poor credit risk customers from good
credit card customers
 Differentiate students who had one backlogs in their
academic
 Determine employees who have taken loan for any
purpose
Data Mining differs from usual query processing in
many ways
Query Processing Data Mining
Query Wel formed as
Select…
From…
Where……
Query is not well formed.
What is found out that is
usually hidden
Data Data from online
transaction processing
systems generally in table
formats
Data is integrated from
various sources. Huge
amount of data
Output Subset of databases Not only subset but also
in analyzed and in terms
of patterns
7
8
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
•Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9
Knowledge discovery in databases (KDD)-is a multistep
process of finding useful information and patterns in
data while Data Mining is one of the steps in KDD of
using algorithms for extraction of patterns
Steps Of KDD
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted
or corrected
10
3. Transformation-
Data Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation -
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns
Knowledge presentation- present the mined knowledge-
visualization techniques can be used
11
Transformation
KDD is the nontrivial extraction of
implicit previously unknown and
potentially useful knowledge from
data
Knowledge Discovery Process
Preprocessing
Data Mining
Pattern Interpretation and
evaluation
Selection
12
13
27
40
34
54
24 25
29
0
10
20
30
40
50
60
a b c d e f g
Graph 1
a
b
c
d
e
f
g
Graphical-bar charts, pie
charts histograms
Icon-based- using colors figures as
icons
14
Hierarchical- Hierarchically dividing
display area
Geometric-boxplot, scatter plot
15
Pixel-based- data as colored pixels
Hybrid- combination of above
approaches
Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
16
Data Mining algorithms-All algorithms attempt to fit a model
closest to the data being examined.
Model is based on the analysis of attributes of a training data
set
The Model is than evaluated using a test data set
Data Model can be
 Predictive model makes predictions regarding data values
using the results found from available data. Thus it makes use
of historical data to make predictions
 Descriptive model identifies patterns or relationships in data. It
finds out the properties of existing data and does not predict
the new properties.
17
Data Mining
Predictive Descriptive
Classification
Regression
Time series Analysis
Prediction
Clustering
Summarization
Association rules
Sequence Discovery
18
Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using training
data set containing data attributes and associated class labels
Example : result of a student. In which class students result will be…
Pattern recognition is type of classification where input patter is
classified into several classes based on its similarity to predefined
classes.
Example: to identify terrorists from passengers. They are identified with
their basic pattern as distance between eyes, size and shape. Then
these patterns are compared with entries into data to see whether
any match were found.
19
20
21
Grade Useful Heat Value(kcal/kg)
A >6200
B 5601 - 6200
C 4941 - 5600
D 4201 - 4940
E 3361 - 4200
F 2401 - 3360
G 1301 - 2400
22
Regression-maps data into real-valued prediction variable.
Algorithm tries to find best function (linear, Non-linear that fits the
training data). Assumes that target data always fits into some
function.
Example . College professor determines his retirement plan based on
current savings and income. If professor want to do more savings
then he must alter his experiences by using simple linear regression
formula.
23
Time Series Analysis- the value of an attribute is examined as it varies over
time
It can be used to determine similarities, classify the behavior or predict future
values
Example
Share market
Prediction – predicts future values using regression, time series analysis or other
approaches
Example
To find out flood prediction of river depending on water level, rain amount time,
humidity. Sensors at different locations are placed in the river area which will
monitor flood condition and flood prediction can be done.
Whether analysis
Pollution analysis
24
25
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Example
Students are clustered among various attributes like good
academics, area in which they live, age, height, weight, body
mass index, extra curricular activities.
Clusters do not have specific size and shape.
26
Outlier
27
Summarization - maps data into subsets with simple descriptions- It extracts or
derives representative summary type of information
Example
Summary of student result whish give you number of students appeared for the
exam passed, failed and according to classes
Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased together
Example: person buying a sugar in the mall also buys milk. The thing
which person buy together will always kept together.
28
Sequence Discovery- discovers sequential patterns in
data-order in which items are purchased or data is
accessed
Example:
When TV set will be purchased by customer , sales
manager assumes that customer also buys some cds and
music system.
29
Influence from many disciplines
Data Mining
Artificial
IntelligenceInformation
Technology Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Algorithm
Visualization
Mathematical
Modeling
30
Depending on data mining approach, techniques from
other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
31
Data Mining issues
 Human interaction- to analyze the output and find the
correct inference after data mining step interfaces required
with both domain and technical experts
 Over fitting – It occurs when the model fits for the current
data exactly but does not fit for future data-if training
dataset will be wrong then over fitting occurs
 Outliers – The model may get distorted because of the
presence of outliers
 Interpretation of results- experts are required due to
interpretability problems
 Visualization of results- visualization helps to display
analyzed data – but for multi-dimensional data visualization
becomes problematic
32
Data Mining issues continued…
 Large datasets- scalability may arise – as algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools are used to solve this problem
 High dimensionality -Conventional database may contain
many different attributes out of them all are not relevant. Some
may increases complexity and reduces efficiency. This is known
as dimensionality curse -data reduction can be done so that
dimensionality reduction will also be there.
 Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
 Missing data -It is not always possible to ignore missing data
but in preprocessing data mining algorithms can be used to
replace missing data with estimates
33
Data Mining issues continued…
 Irrelevant data – data reduced by removing irrelevant data
 Noisy data –Invalid , incorrect data will lead to poor quality
data mining
 Changing data- Data warehouses contain non-volatile data-
Dynamic data is uploaded and then algorithms are reapplied to
check their correct working.
 Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database systems
 Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining algorithm
34
Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
Social Implications of Data mining
It is two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone
35
Data mining should follow certain Guidelines
Purpose specification and use limitation
Openness
Security safeguards
Individual participation
Privacy Preserving data mining
- secure Multiparty computation
- data obscuration
36
Applications of Data Mining
Security-To find out terrorists using classification
technique
Whether- To predict whether, pollution
Finance-Share market
Ecommerce-Market basket analysis
Education-Student result preparation
Bank- Analysis of customer for buying loan
Research- Data Analysis
Fraud detection
Marketing-targeting customers
Molecular biology
Astronomy
Health- to find out disease in peoples
37
Books for Reference
Data Mining, Introduction and Advanced Topics by
Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques by Jiawei Han
and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5
.
38

Data mining an introduction

  • 1.
    1 Mrs. Dipali Meher ModernCollege of Arts, Science and Commerce, Ganeshkhind, Pune 411016 Data Mining : An Introduction
  • 2.
    2 Bayes Thm(1763) Regression(1805) KDD(1989) Support VectorMachine(1992) Data Science(2001) Moneyball(2003) Turing(1963) Neural Networks(1943) Evolutionary Computation(1965) Databases(1970) Genetic Algorithms(1975) Big Data From Then till Now…..
  • 3.
  • 4.
    4 Data Mining dealswith the discovery of hidden Knowledge , unexpected pattern and new rules from large data sets
  • 5.
    5 Examples of Informationextracted using query language  List customers who use credit card to purchase more than Rs. 10000 worth groceries  List patients who had at least one heart attack  List students who had at least one backlog  List employees who have taken home loans
  • 6.
    6 Examples of whatdata mining is used for  Develop a general profile of credit card customers  Determine patients whose lifestyle is prone to getting a heart attack in near future  Differentiate poor credit risk customers from good credit card customers  Differentiate students who had one backlogs in their academic  Determine employees who have taken loan for any purpose
  • 7.
    Data Mining differsfrom usual query processing in many ways Query Processing Data Mining Query Wel formed as Select… From… Where…… Query is not well formed. What is found out that is usually hidden Data Data from online transaction processing systems generally in table formats Data is integrated from various sources. Huge amount of data Output Subset of databases Not only subset but also in analyzed and in terms of patterns 7
  • 8.
    8 Data mining (knowledgediscovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 9.
    •Data mining (knowledgediscovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? •Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 9
  • 10.
    Knowledge discovery indatabases (KDD)-is a multistep process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns Steps Of KDD 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources - Databases, Data warehouses, World wide web or other information repositories 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected 10
  • 11.
    3. Transformation- Data Integration-Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced 4. Data mining – Apply algorithms to transformed data an extract patterns 5. Pattern Interpretation/evaluation - Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns Knowledge presentation- present the mined knowledge- visualization techniques can be used 11
  • 12.
    Transformation KDD is thenontrivial extraction of implicit previously unknown and potentially useful knowledge from data Knowledge Discovery Process Preprocessing Data Mining Pattern Interpretation and evaluation Selection 12
  • 13.
    13 27 40 34 54 24 25 29 0 10 20 30 40 50 60 a bc d e f g Graph 1 a b c d e f g Graphical-bar charts, pie charts histograms Icon-based- using colors figures as icons
  • 14.
    14 Hierarchical- Hierarchically dividing displayarea Geometric-boxplot, scatter plot
  • 15.
    15 Pixel-based- data ascolored pixels Hybrid- combination of above approaches
  • 16.
    Why Data Mining?—PotentialApplications  Data analysis and decision support  Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis 16
  • 17.
    Data Mining algorithms-Allalgorithms attempt to fit a model closest to the data being examined. Model is based on the analysis of attributes of a training data set The Model is than evaluated using a test data set Data Model can be  Predictive model makes predictions regarding data values using the results found from available data. Thus it makes use of historical data to make predictions  Descriptive model identifies patterns or relationships in data. It finds out the properties of existing data and does not predict the new properties. 17
  • 18.
    Data Mining Predictive Descriptive Classification Regression Timeseries Analysis Prediction Clustering Summarization Association rules Sequence Discovery 18
  • 19.
    Classification- maps datainto predefined groups or classes It uses supervised learning . The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Example : result of a student. In which class students result will be… Pattern recognition is type of classification where input patter is classified into several classes based on its similarity to predefined classes. Example: to identify terrorists from passengers. They are identified with their basic pattern as distance between eyes, size and shape. Then these patterns are compared with entries into data to see whether any match were found. 19
  • 20.
  • 21.
    21 Grade Useful HeatValue(kcal/kg) A >6200 B 5601 - 6200 C 4941 - 5600 D 4201 - 4940 E 3361 - 4200 F 2401 - 3360 G 1301 - 2400
  • 22.
    22 Regression-maps data intoreal-valued prediction variable. Algorithm tries to find best function (linear, Non-linear that fits the training data). Assumes that target data always fits into some function. Example . College professor determines his retirement plan based on current savings and income. If professor want to do more savings then he must alter his experiences by using simple linear regression formula.
  • 23.
    23 Time Series Analysis-the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Example Share market
  • 24.
    Prediction – predictsfuture values using regression, time series analysis or other approaches Example To find out flood prediction of river depending on water level, rain amount time, humidity. Sensors at different locations are placed in the river area which will monitor flood condition and flood prediction can be done. Whether analysis Pollution analysis 24
  • 25.
    25 Clustering -Finding similaritiesbetween data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Example Students are clustered among various attributes like good academics, area in which they live, age, height, weight, body mass index, extra curricular activities. Clusters do not have specific size and shape.
  • 26.
  • 27.
    27 Summarization - mapsdata into subsets with simple descriptions- It extracts or derives representative summary type of information Example Summary of student result whish give you number of students appeared for the exam passed, failed and according to classes
  • 28.
    Association rules–discovers relationshipamong data – used in Market basket analysis to find item frequently purchased together Example: person buying a sugar in the mall also buys milk. The thing which person buy together will always kept together. 28
  • 29.
    Sequence Discovery- discoverssequential patterns in data-order in which items are purchased or data is accessed Example: When TV set will be purchased by customer , sales manager assumes that customer also buys some cds and music system. 29
  • 30.
    Influence from manydisciplines Data Mining Artificial IntelligenceInformation Technology Database Technology Machine Learning Pattern Recognition Statistics Algorithm Visualization Mathematical Modeling 30
  • 31.
    Depending on datamining approach, techniques from other disciplines may be applied such as •Information Retrieval •Artificial Intelligence •Neural networks •Fuzzy set theory •Knowledge representation •Logic programming •High performance computing 31
  • 32.
    Data Mining issues Human interaction- to analyze the output and find the correct inference after data mining step interfaces required with both domain and technical experts  Over fitting – It occurs when the model fits for the current data exactly but does not fit for future data-if training dataset will be wrong then over fitting occurs  Outliers – The model may get distorted because of the presence of outliers  Interpretation of results- experts are required due to interpretability problems  Visualization of results- visualization helps to display analyzed data – but for multi-dimensional data visualization becomes problematic 32
  • 33.
    Data Mining issuescontinued…  Large datasets- scalability may arise – as algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools are used to solve this problem  High dimensionality -Conventional database may contain many different attributes out of them all are not relevant. Some may increases complexity and reduces efficiency. This is known as dimensionality curse -data reduction can be done so that dimensionality reduction will also be there.  Multimedia data - found in GIS databases proves conventional data mining algorithms ineffective  Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates 33
  • 34.
    Data Mining issuescontinued…  Irrelevant data – data reduced by removing irrelevant data  Noisy data –Invalid , incorrect data will lead to poor quality data mining  Changing data- Data warehouses contain non-volatile data- Dynamic data is uploaded and then algorithms are reapplied to check their correct working.  Integration- KDD requests are one time needs-data mining functions are now integrated into traditional database systems  Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm 34
  • 35.
    Data Mining Metrics Howto measure the effectiveness of data mining process? -KDD process is expensive- Return on investment will be the saving due to decision process using the results -Difficult to measure and quantify Social Implications of Data mining It is two sides of the coin Data mining can be used to improve customer service and satisfaction Data mining can be used to confront one’s right to privacy Omnipresent Invisible Data mining affecting everyone 35
  • 36.
    Data mining shouldfollow certain Guidelines Purpose specification and use limitation Openness Security safeguards Individual participation Privacy Preserving data mining - secure Multiparty computation - data obscuration 36
  • 37.
    Applications of DataMining Security-To find out terrorists using classification technique Whether- To predict whether, pollution Finance-Share market Ecommerce-Market basket analysis Education-Student result preparation Bank- Analysis of customer for buying loan Research- Data Analysis Fraud detection Marketing-targeting customers Molecular biology Astronomy Health- to find out disease in peoples 37
  • 38.
    Books for Reference DataMining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81-7758-785-4 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers ISBN 81-312-0535-5 . 38