Cosc6211 Advanced Concepts in Data Mining: Weekend
Cosc6211 Advanced Concepts in Data Mining: Weekend
Data Mining
Weekend
General Information
Cosc6211 2
Grading: tentative
Cosc6211 3
Data sources
• Data sources from internet
· UCI KDD Archive
· UCI Machine Learning Library
Cosc6211 4
Where to Find the Set of Slides
for the Text Book
– http://www.cs.sfu.ca/~han/dmbook
• Other conference presentation slides (.ppt):
– http://db.cs.sfu.ca/ or http://www.cs.sfu.ca/~han
Cosc6211 5
Teaching materials
Textbooks
• Jiawei Han and Micheline Kamber, “Data Mining:
Concepts and Techniques”.
References
• Pang-Ning Tan, Michael Steinbach, and Vipin
Kumar, "Introduction to Data Mining", Pearson
Addison Wesley, 2008, ISBN: 0-32-134136-7
• Margaret H. Dunham, Data Mining: Introductory
and Advanced Topics, Prentice Hall, 2003.
Cosc6211 6
Course outline
1. Introduction , Data preprocessing and
Association Rules Mining
2. Classification and Predication
– Decision Trees , Bayesian Classifier, rule based,
Ensemble and SVM , k-nearest neighbor, Neural
Networks and other classifications
3. Clustering
4. Complex data mining and text mining
Cosc6211 7
Outline
• Motivation: Why data mining?
• What is data mining?
• Data Mining: On what kind of data?
• Data mining Task
• Are all the patterns interesting?
• Major issues in data mining
• Association Rule Mining (if time allows)
Cosc6211 8
1. Introduction and Data preprocessing
and Association rule mining
Why Data Mining?
• The Explosive Growth of Data: from terabytes
to petabytes(1 million gigabytes)
– Data collection and data availability
• Automated data collection tools, database systems,
Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific
simulation, …
• Society and everyone: news, digital cameras, YouTube
Cosc6211 10
Why Data Mining? Commercial Viewpoint
• Lots of data is being collected and warehoused
– Web data
• Google has Peta Bytes of web data
• Facebook has billions of active users
– purchases at department/ grocery stores, e-commerce
• Amazon handles millions of visits/day
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Cosc6211 11
Why Data Mining? Scientific Viewpoint
• Data collected and stored at enormous speeds
– remote sensors on a satellite
• NASA EOSDIS archives over petabytes of earth science data / year
– telescopes scanning the skies
• Sky survey data
– scientific simulations
• terabytes of data generated in a few hours
• Data mining helps scientists
– in automated analysis of massive datasets
– In hypothesis formation
Cosc6211 12
Why is data mining?
Cosc6211 13
What Is Data Mining?
Cosc6211 14
What is (not) Data Mining?
• Is not data mining:
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
• Is data mining:
– Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest,
Amazon.com,)
• Identify the following:
– Sales analysis
• What are the sales by quarter and region?
• How do sales compare in two different stores in the same state?
Cosc6211 15
Knowledge Discovery (KDD) Process
– Data mining: the core of
knowledge discovery Knowledge Interpretation
process.
Data Mining
Task-relevant Data
Data transformations
Preprocessed Selection
Data
Data Cleaning
Data Integration
Databases Cosc6211 16
Steps of a KDD Process
Cosc6211 17
Why Data Preprocessing?
Cosc6211 18
Why Is Data Dirty?
Cosc6211 19
Why Is Data Preprocessing Important?
Cosc6211 21
Origins of Data Mining
concerning
• large amount of data
• high dimensionality of data
• heterogeneous and
complex nature of data Statistics
Cosc6211 22
Data Mining Tasks
• Descriptive Tasks
– Goal: Find patterns in the data.
– Example: Which products are often bought together?
• Predictive Tasks
– Goal: Predict unknown values of a variable
• given observations (e.g., from the past)
• Machine Learning Terminology
– descriptive = unsupervised
– predictive = supervised
Cosc6211 23
Classic data mining tasks
• Classification:
mining patterns that can classify future (new) data into known
classes.
• Association rule mining
mining any rule of the form X Y, where X and Y are sets of
data items. E.g.,
Cheese, Milk Bread [sup =5%, confid=80%]
Age(X, ”20..29”) and income(X, ”20k..29k”) -> buys(X, ”cd-
player”) [support=2%, confidence=60%]
• Clustering
identifying a set of similarity groups in the data
Cosc6211 24
Classic data mining tasks (contd)
Cosc6211 25
Data Mining Applications
Market analysis and management
• Target marketing
– Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
– Determine customer purchasing patterns over time
• Cross-market analysis—Find associations/co-
relations between product sales, & predict
based on such association
Cosc6211 26
Data Mining Applications
Market analysis and management(2)
• Customer profiling
– data mining can identify what types of customers
buy what products (clustering or classification)
• Identify customer requirements
– identify the “best” products for different
customers
– use prediction techniques to find what factors will
attract new customers
Cosc6211 27
Data Mining Applications
Fraud Detection & Mining Unusual Patterns
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or week. Analyze patterns
that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
– Anti-terrorism
Cosc6211 28
Are All the “Discovered” Patterns Interesting?
Cosc6211 29
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Cosc6211 30
Summary
• Data Mining is a process of extracting knowledge from data
• Data to be mined can be of any type
– Relational Databases, Advanced databases, etc.
• Knowledge to be discovered
– Frequent patterns, correlations, associations, classification, prediction,
clustering
• Data Mining is interdisciplinary
– Large amount of complex data and sophisticated applications
• Challenges of data Mining
– Efficiency, scalability, parallel and distributed mining, handling high
dimensionality, handling noisy data, mining heterogeneous data, etc.
Cosc6211 31
Where to Find References?
Cosc6211 32
• Next: Association rule mining
Cosc6211 33