Data Mining From
Scratch With Excel And R
Pemateri:
Marcelinus A.S. Adhiwibawa, S.P., M.Stat.
Alumnus PS Magister Statistika Univ. Brawijaya
Research Assistant MRCPP Univ Ma Chung
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras,
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
2
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
3
What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
• Alternative name
• Knowledge discovery in databases (KDD)
4
Why Data Mining?—Potential Applications
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
• Risk analysis and management
• Forecasting, customer retention, quality control, competitive
analysis
• Fraud detection and detection of unusual patterns (outliers)
5
Why Data Mining?—Potential Applications
• Other Applications
• Text mining (news group, email, documents) and Web
mining
• Stream data mining
• Bioinformatics and bio-data analysis
6
Market Analysis and Management
• Where does the data come from?
• Credit card transactions, discount coupons, customer
complaint calls
• Target marketing
• Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
7
Market Analysis and Management
• Cross-market analysis
• Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
• What types of customers buy what products
• Customer requirement analysis
• Identifying the best products for different customers
• Predict what factors will attract new customers
8
Fraud Detection & Mining Unusual Patterns
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, telecomm.
• Medical insurance
• Professional patients, and ring of doctors
• Unnecessary or correlated screening tests
• Telecommunications:
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
• Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees
9
Data Mining: A KDD Process
• Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases 10
Steps of a KDD Process
• Learning the application domain
• Relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction.
• Choosing functions of data mining
• Summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• Visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
11
Data Mining: On What Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
• Spatial and temporal data
• Time-series data
• Stream data
• Multimedia database
• Text databases & WWW
12
Data Mining Functionalities
Data mining functionalities are generally divided into two major
categories:
• Predictive tasks [Use some attributes to predict unknown or future
values of other attributes.]
• Classification
• Regression
• Descriptive tasks [Find human-interpretable patterns that describe the
data.]
• Association Discovery
• Clustering
Are All the “Discovered” Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them are
interesting
• Interestingness measures
• A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures
• Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
• Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
14
Origins of Data Mining
• Draws ideas from: machine learning/AI, statistics, and database
systems
Statistics
Machine Learning
Data Mining
Database
systems
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Data Mining Visualization
Learning
Algorithm Other
Disciplines
16
Major Issues in Data Mining
• Mining methodology
• Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
• Integration of the discovered knowledge with existing one:
knowledge fusion
17