Introduction
Introduction
Introduction(2 Hrs)
Pukar Karki
Assistant Professor
[email protected]
Contents
1. Data Mining Origin
2. Data Mining & Data Warehousing basics
Why Data Mining?
3
Why Data Mining?
4
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”
5
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate experiments
and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical,
theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form
solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks scale almost
linearly with data volumes. Data mining is a major new challenge
6
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems 7
What is Data Mining?
8
What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
9
Knowledge Discovery (KDD) Process
1. Data cleaning (to remove noise and inconsistent data)
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
KDD Process: A Typical View from ML and Statistics
15
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other
applications?
16
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-
based classification, pattern-based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …
17
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster houses
to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing interclass
similarity
Many methods and applications
18
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general behavior of
the data
Noise or exception? ― One person’s garbage could be another person’s
treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
19
Data Mining: Confluence of Multiple Disciplines
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications 21
Applications of Data Mining
22
Applications of Data Mining
Web page analysis: from web page classification, clustering to PageRank & HITS
algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis (microarray data
analysis), biological sequence analysis, biological network analysis
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis
Manager, Oracle Data Mining Tools) to invisible data mining
23
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
24
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
25
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organization’s
operational database
Support information processing by providing a solid platform of consolidated,
historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H.
Inmon
Data warehousing:
The process of constructing and using data warehouses
26
Data Warehouse—Subject-Oriented
27
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
28
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than
that of operational systems
Operational database: current value data
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time
element”
29
Data Warehouse—Nonvolatile
A physically separate store of data transformed from the operational
environment
Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
30
DBMS vs. Data Warehouse
The major task of online operational database systems is to perform
online transaction and query processing.
These systems are called online transaction processing (OLTP)
systems.
They cover most of the day-to-day operations of an organization such
as purchasing, inventory, manufacturing, banking, payroll, registration,
and accounting.
DBMS vs. Data Warehouse
Data warehouse systems, on the other hand, serve users or
knowledge workers in the role of data analysis and decision making.
Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users.
These systems are known as online analytical processing (OLAP)
systems.
OLTP vs. OLAP
33
Data Warehousing: A Multitiered
Architecture
✔
The bottom tier is a warehouse
database server that is almost always a
relational database system.
✔
Back end tools and utilities are used to
feed data into the bottom tier from
operational databases or other external
sources.
Data Warehousing: A Multitiered
Architecture
✔
The middle tier is an OLAP server that is
typically implemented using either
- a relational OLAP (ROLAP) model (i.e., an
extended relational DBMS that maps
operations on multidimensional data to
standard relational operations); or
- a multi dimensional OLAP (MOLAP) model
(i.e., a special-purpose server that directly
implements multidimensional data and
operations).
Data Warehousing: A Multitiered
Architecture
✔
The top tier is a front-end client layer,
which contains query and reporting
tools, analysis tools, and/or data mining
tools (e.g., trend analysis, prediction,
and so on)
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data definition, data mart locations and
contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports, audit
trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies 37
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning the entire
organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized 38
Data Mart
Data Mart
a subset of corporate-wide data that is of value to a specific
groups of users.
Its scope is confined to specific, selected groups, such as
39
Data Mart
Data Mart
The implementation cycle of a data mart is more likely to be
40
Data Mart
Data Mart
Depending on the source of data, data marts can be categorized
as independent or dependent.
– Independent data marts are sourced from data captured from
one or more operational systems or external information providers,
or from data generated locally within a particular department or
geographic area.
– Dependent data marts are sourced directly from enterprise data
warehouses.
41
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity, and build
42
Need for Data Warehousing
Ensure consistency
- Data warehouses are programmed to apply uniform format to all collected data,
which makes it easier for corporate decision-makers to analyze and share data
insights with their colleagues around the globe.
- Standardizing data from different sources also reduces the risk of error in
interpretation and improves overall accuracy.
Need for Data Warehousing
Make better business decisions
- Successful business leaders develop data-driven strategies and rarely make
decisions without consulting the facts.
- Data warehousing improves the speed and efficiency of accessing different
data sets and makes it easier for corporate decision-makers to derive insights
that will guide the business and marketing strategies that set them apart from
their competitors.
Need for Data Warehousing
Improve their bottom line
- Data warehouse platforms allow business leaders to quickly access their
organization's historical activities and evaluate initiatives that have been
successful or unsuccessful in the past.
- This allows executives to see where they can adjust their strategy to decrease
costs, maximize efficiency and increase sales to improve their bottom line.
Review Question
1) Explain how data mining system can be integrated with database/data
warehouse system. Explain data mining process with diagram.
2) Explain data warehouse architecture.
3) How is data warehouse different from RDBMS? Also list the similarities.
4) What is data warehouse and data mart?
5) Differentiate between OLAP and OLTP.
6) “The world is data rich and information poor.” Justify in your own words.