0% found this document useful (0 votes)
20 views

Introduction

Data Mining IOE - Chapter 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction

Data Mining IOE - Chapter 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

1.

Introduction(2 Hrs)

Pukar Karki
Assistant Professor
[email protected]
Contents
1. Data Mining Origin
2. Data Mining & Data Warehousing basics
Why Data Mining?

Necessity, who is the mother of invention. – Plato

3
Why Data Mining?

4
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”

5
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often motivate experiments
and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical,
theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to find closed-form
solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks scale almost
linearly with data volumes. Data mining is a major new challenge
6
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems 7
What is Data Mining?

8
What is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
9
Knowledge Discovery (KDD) Process
1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be


combined)

3. Data selection (where data relevant to the analysis task


are retrieved from the database)

4. Data transformation (where data are transformed and


consolidated into forms appropriate for mining by performing
summary or aggregation operations)

5. Data mining (an essential process where intelligent


methods are applied to extract data patterns)

6. Pattern evaluation (to identify the truly interesting


patterns representing knowledge based on interestingness
measures)

7.Knowledge presentation (where visualization and


knowledge representation techniques are used to present
mined knowledge to users)
Data Mining in Business Intelligence
Increasing potential
to support
End User
business decisions Decision
Making
Data Presentation Business
Visualization Techniques Analyst

Data Mining Data


Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification
Pattern interpretation
Dimension reduction Clustering &Outlier analysis
Pattern visualization

 This is a view from typical machine learning and statistics communities


12
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.
13
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
14
Data Mining Function: (1) Generalization
 Information integration and data warehouse construction
 Data cleaning, transformation, integration, and multidimensional data
model
 Data cube technology
 Scalable methods for computing (i.e., materializing) multidimensional
aggregates
 OLAP (online analytical processing)
 Multidimensional concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet region

15
Data Mining Function: (2) Association and
Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your Walmart?
 Association, correlation vs. causality
 A typical association rule
 Diaper  Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large datasets?
 How to use such patterns for classification, clustering, and other
applications?

16
Data Mining Function: (3) Classification
 Classification and label prediction
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-
based classification, pattern-based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

17
Data Mining Function: (4) Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g., cluster houses
to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing interclass
similarity
 Many methods and applications

18
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general behavior of
the data
 Noise or exception? ― One person’s garbage could be another person’s
treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

19
Data Mining: Confluence of Multiple Disciplines
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications 21
Applications of Data Mining

Where there are data, there are data mining applications

22
Applications of Data Mining
 Web page analysis: from web page classification, clustering to PageRank & HITS
algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster analysis (microarray data
analysis), biological sequence analysis, biological network analysis
 From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis
Manager, Oracle Data Mining Tools) to invisible data mining

23
Major Issues in Data Mining (1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
24
Major Issues in Data Mining (2)
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
25
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from the organization’s
operational database
 Support information processing by providing a solid platform of consolidated,
historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.”—W. H.
Inmon
 Data warehousing:
 The process of constructing and using data warehouses
26
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer, product, sales


 Focusing on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing
 Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process

27
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data sources
 relational databases, flat files, on-line transaction records

 Data cleaning and data integration techniques are applied.


 Ensure consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources


 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.

28
Data Warehouse—Time Variant
 The time horizon for the data warehouse is significantly longer than
that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time
element”

29
Data Warehouse—Nonvolatile
 A physically separate store of data transformed from the operational
environment
 Operational update of data does not occur in the data warehouse
environment
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

30
DBMS vs. Data Warehouse
 The major task of online operational database systems is to perform
online transaction and query processing.
 These systems are called online transaction processing (OLTP)
systems.
 They cover most of the day-to-day operations of an organization such
as purchasing, inventory, manufacturing, banking, payroll, registration,
and accounting.
DBMS vs. Data Warehouse
 Data warehouse systems, on the other hand, serve users or
knowledge workers in the role of data analysis and decision making.
 Such systems can organize and present data in various formats in
order to accommodate the diverse needs of different users.
 These systems are known as online analytical processing (OLAP)
systems.
OLTP vs. OLAP

33
Data Warehousing: A Multitiered
Architecture


The bottom tier is a warehouse
database server that is almost always a
relational database system.

Back end tools and utilities are used to
feed data into the bottom tier from
operational databases or other external
sources.
Data Warehousing: A Multitiered
Architecture

The middle tier is an OLAP server that is
typically implemented using either
- a relational OLAP (ROLAP) model (i.e., an
extended relational DBMS that maps
operations on multidimensional data to
standard relational operations); or
- a multi dimensional OLAP (MOLAP) model
(i.e., a special-purpose server that directly
implements multidimensional data and
operations).
Data Warehousing: A Multitiered
Architecture

The top tier is a front-end client layer,
which contains query and reporting
tools, analysis tools, and/or data mining
tools (e.g., trend analysis, prediction,
and so on)
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data definition, data mart locations and
contents
 Operational meta-data
 data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports, audit
trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance
 warehouse schema, view and derived data definitions
 Business data
 business terms and definitions, ownership of data, charging policies 37
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning the entire

organization
 Data Mart
 a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
 Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be materialized 38
Data Mart
Data Mart
 a subset of corporate-wide data that is of value to a specific

groups of users.
 Its scope is confined to specific, selected groups, such as

marketing data mart.


 Independent vs. dependent (directly from warehouse) data mart.

39
Data Mart
Data Mart
 The implementation cycle of a data mart is more likely to be

measured in weeks rather than months or years.


 However, it may involve complex integration in the long run if its

design and planning were not enterprise-wide.

40
Data Mart
Data Mart
 Depending on the source of data, data marts can be categorized

as independent or dependent.
– Independent data marts are sourced from data captured from
one or more operational systems or external information providers,
or from data generated locally within a particular department or
geographic area.
– Dependent data marts are sourced directly from enterprise data
warehouses.

41
Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external sources

 Data cleaning
 detect errors in the data and rectify them when possible

 Data transformation
 convert data from legacy or host format to warehouse format

 Load
 sort, summarize, consolidate, compute views, check integrity, and build

indices and partitions


 Refresh
 propagate the updates from the data sources to the warehouse

42
Need for Data Warehousing
 Ensure consistency
- Data warehouses are programmed to apply uniform format to all collected data,
which makes it easier for corporate decision-makers to analyze and share data
insights with their colleagues around the globe.
- Standardizing data from different sources also reduces the risk of error in
interpretation and improves overall accuracy.
Need for Data Warehousing

Make better business decisions
- Successful business leaders develop data-driven strategies and rarely make
decisions without consulting the facts.
- Data warehousing improves the speed and efficiency of accessing different
data sets and makes it easier for corporate decision-makers to derive insights
that will guide the business and marketing strategies that set them apart from
their competitors.
Need for Data Warehousing

Improve their bottom line
- Data warehouse platforms allow business leaders to quickly access their
organization's historical activities and evaluate initiatives that have been
successful or unsuccessful in the past.
- This allows executives to see where they can adjust their strategy to decrease
costs, maximize efficiency and increase sales to improve their bottom line.
Review Question
1) Explain how data mining system can be integrated with database/data
warehouse system. Explain data mining process with diagram.
2) Explain data warehouse architecture.
3) How is data warehouse different from RDBMS? Also list the similarities.
4) What is data warehouse and data mart?
5) Differentiate between OLAP and OLTP.
6) “The world is data rich and information poor.” Justify in your own words.

You might also like