Unit 3
Motivation:
“Necessity is the Mother of Invention”
Data Explosion Problem
Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
We are drowning in data, but starving for knowledge
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
Why data mining?
Commercial point of view
Data has become the key competitive advantage of companies
Examples: Facebook, Google, Amazon
Being able to extract useful information out of the data is key for
exploiting them commercially.
Scientific point of view
Scientists are at an unprecedented position where they can collect TB of
information
Examples: Sensor data, astronomy data, social network data, gene data
We need the tools to analyze such data to get a better understanding of
the world and advance science
Scale (in data size and feature dimension)
Why not use traditional analytic methods?
Enormity of data, curse of dimensionality
The amount and the complexity of data does not allow for manual
processing of the data. We need automated techniques.
What is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns from data in large databases
Alternative names and their “inside stories”:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
Knowledge Discovery in Databases(KDD)
KDD cistm
Data cleaning: also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided
on and retrieved from the data collection.
Data transformation: also known as data consolidation, it is a phase
in which the selected data is transformed into forms appropriate for the
mining procedure.
Data mining: it is the crucial step in which clever techniques are
applied to extract patterns potentially useful.
Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given measures.
Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the
data mining results.
Data Mining: Classification Schemes
Decisions in data mining
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Classification Criteria in Data Mining
Databases to be mined
Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Technology
Machine Visualization
Data Mining
Learning
Pattern Other
Recognition Algorithm Disciplines
Data Mining Tasks
Prediction Tasks
Use some variables to predict unknown or future values of other
variables.
Description Tasks
characterize the general properties of the data in the database.
Common data mining tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation/Anamoly Detection [Predictive]
Data mining functionalities
Data characterization: Data characterization is a
summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified
class are typically collected by a database query.
For example, one may wish to characterize the customers of a
store who regularly rent more than 30 movies a year. With a data
cube containing summarization of data, simple OLAP
operations fit the purpose of data characterization.
Data mining functionalities
Discrimination: Data discrimination produces what are called
discriminant rules and is basically the comparison of the general
features of objects between two classes referred to as the target class
and the contrasting class.
For example, one may wish to compare the general characteristics of
the customers who rented more than 30 movies in the last year with
those whose rental account is lower than 5.
The techniques used for data discrimination are similar to the
techniques used for data characterization with the exception
that data discrimination results include comparative measures.
Data mining functionalities
Mining Frequent Patterns, Associations, and
Correlations
Frequent patterns, as the name suggests, are patterns that occur
frequently in data. There are many kinds of frequent patterns,
including:
itemsets,
Subsequences,
substructures.
Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Data mining functionalities
Association analysis: Association analysis studies the
frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent
item sets. Another threshold, confidence, which is the
conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association rules.
Association analysis is widely used for market basket or
transaction data analysis.
Data mining functionalities
An example of such a rule, mined from the AllElectronics transactional
database, is
buys(X; “computer”)=>buys(X; “software”) [support = 1%;
confidence = 50%]
where X is a variable representing a customer. A confidence, or
certainty, of 50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis
showed that computer and software were purchased together.
This association rule involves a single attribute or predicate (i.e.,
buys) that repeats. Association rules that contain a single
predicate are referred to as single-dimensional association rules.
Data mining functionalities
A data mining system may find association rules like
age(X, “20:::29”)^income(X, “20K:::29K”)=>buys(X, “CD player”)
[support = 2%, confidence = 60%]
Note that this is an association between more than one attribute, or
predicate (i.e., age, income, and buys). each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional
association rule.
Typically, association rules are discarded as uninteresting if they do not
satisfy both a minimum support threshold and a minimum confidence
threshold. Additional analysis can be performed to uncover interesting
statistical correlations between associated attribute-value pairs.
Data mining functionalities
Classification : Classification is the process of finding a model (or
function) that describes and distinguishes data classes or concepts, for
the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on
the analysis of a set of training data (i.e., data objects whose class label
is known).
Classification model can be represented in various forms such as
IF-THEN Rules
A decision tree
Neural network
Support Vector Machine(SVM)
Bayesian Classification
Classification vs. Prediction
Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying attribute
and uses it in classifying new data
Prediction models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit/loan approval
Target Marketing
Medical diagnosis :if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each
tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute The set of tuples used for model
construction is training set. The model is represented as classification
rules, decision trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model The known label of test sample
is compared with the classified result from the model Accuracy rate
is the percentage of test set samples that are correctly classified by
the model
Classification
Classification
Classification
Clustering
Clustering: Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters.
However, unlike classification, in clustering, class labels are unknown
and it is up to the clustering algorithm to discover acceptable classes.
Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels.
There are many clustering approaches, all based on the principle of
maximizing the similarity between objects in a same class (intra-class
similarity) and minimizing the similarity between objects of different
classes (inter-class similarity).
Clustering
Applications of Cluster
Analysis
Understanding –
Group related documents for
browsing,
group genes and proteins that
have similar functionality,
or group stocks with similar
price fluctuations
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Supervised vs. Unsupervised
Learning
Supervised learning (classification) :
The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Data mining functionalities
Outlier analysis: Outliers are data elements that cannot be grouped in
a given class or cluster. Also known as exceptions or surprises, they are
often very important to identify. While outliers can be considered noise
and discarded in some applications, they can reveal important
knowledge in other domains, and thus can be very significant and their
analysis valuable.
Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same
account.
Outlier values may also be detected with respect to the location and
type of purchase, or the purchase frequency.
Data Mining Architecture
Data mining is a very important process where potentially useful
and previously unknown information is extracted from large
volumes of data.
There are a number of components involved in the data mining
process. These components constitute the architecture of a data
mining system.
The major components of any data mining system are data
source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and
knowledge base
Data Mining Architecture
Data Mining Architecture
Data Sources
Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large
volumes of historical data for data mining to be successful.
Different Processes
The data needs to be cleaned, integrated and selected before
passing it to the database or data warehouse server.
Database or Data Warehouse Server
The database or data warehouse server contains the actual data that
is ready to be processed. Hence, the server is responsible for
retrieving the relevant data based on the data mining request of the
user.
Data Mining Architecture
Data Mining Engine
It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.
Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of
interestingness of the pattern by using a threshold value. It interacts
with the data mining engine to focus the search towards interesting
patterns.
Graphical User Interface
The graphical user interface module communicates between the user
and the data mining system. This module helps the user use the system
easily and efficiently without knowing the real complexity behind the
process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily
understandable manner.
Data Mining Architecture
Knowledge Base
The knowledge base is helpful in the whole data mining process. It
might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might
even contain user beliefs and data from user experiences that can be
useful in the process of data mining. The data mining engine might
get inputs from the knowledge base to make the result more
accurate and reliable.
Data Mining Issues
Major issues in data mining are partitioned in five
groups:
Mining methodology
User interaction
Efficiency and scalability
Diversity of data types
Data mining and society
Data Mining Issues
Mining methodology
Mining different kinds of knowledge in databases
mining of knowledge at multiple levels of abstraction
Handling noisy or incomplete data
Pattern evaluation
User interaction
Interactive mining
Incorporation of background knowledge
Query languages and ad hoc mining
Presentation and visualization of data mining results
Data Mining Issues
Efficiency and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Diversity of data types
Handling of relational and complex types of data
Mining information from heterogeneous databases and global
information systems
Data mining and Society
Social impact of data mining
Privacy preserving in data mining
Invisible data mining
Data Mining Applications
Here is the list of areas where data mining is widely used
Healthcare and Insurance
Measuring Treatment Effectiveness – This application of data mining
involves comparing and contrasting symptoms, causes and courses of
treatment to find the most effective course of action for a certain illness or
condition. For example, patient groups who are treated with different drug
regimens can be compared to determine which treatment plans work best
and save the most money.
Detecting Fraud and Abuse – This involves establishing normal patterns,
then identifying unusual patterns of medical claims by clinics, physicians,
labs, or others. This application can also be used to identify inappropriate
referrals or prescriptions and insurance fraud and fraudulent medical
claims. The Texas Medicaid Fraud and Abuse Detection System is a good
example of a business using data mining to detect fraud.
Data Mining Applications
Education
Concerns with developing methods that discover knowledge from data
originating from educational Environments.
The goals is identified as predicting students’ future learning behavior,
studying the effects of educational support. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the
student. With the results the institution can focus on what to teach and
how to teach.
Retail Industry
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Market basket analysis
Data Mining Applications
Banking/Finance
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus
to anomaly detection. It helps an analyst to distinguish an activity from
common everyday network activity.
Monitoring and analyzing traffic
Identifying abnormal activity
Other applications are:
Bio Informatics
Crime agencies
Scientific Applications