0% found this document useful (0 votes)
75 views33 pages

Cosc6211 Advanced Concepts in Data Mining: Weekend

This document provides an overview of an advanced concepts in data mining course. It outlines the instructor details, grading breakdown, data sources, teaching materials, and course outline. The course will cover topics like data preprocessing, association rule mining, classification, clustering, complex data mining and text mining.

Uploaded by

jemal yahyaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views33 pages

Cosc6211 Advanced Concepts in Data Mining: Weekend

This document provides an overview of an advanced concepts in data mining course. It outlines the instructor details, grading breakdown, data sources, teaching materials, and course outline. The course will cover topics like data preprocessing, association rule mining, classification, clustering, complex data mining and text mining.

Uploaded by

jemal yahyaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Cosc6211 Advanced Concepts in

Data Mining
Weekend
General Information

• Instructor: A/fetah A.A


– Email: [email protected]
– Tel:
• Lecture time:
– 9:30am-11:00am, Saturday and Sunday
• Room:

Cosc6211 2
Grading: tentative

• Individual homework : 20-25%


• Project: 15-20%
• Test 10% -> modifiable (if you want..)
• Final Exam: 50%
• Homework and project will be done with
software:
– Weka and other if you have any preference

Cosc6211 3
Data sources
• Data sources from internet
· UCI KDD Archive
· UCI Machine Learning Library

Cosc6211 4
Where to Find the Set of Slides
for the Text Book

• Tutorial sections (MS PowerPoint files):

– http://www.cs.sfu.ca/~han/dmbook
• Other conference presentation slides (.ppt):
– http://db.cs.sfu.ca/ or http://www.cs.sfu.ca/~han

• Research papers, DBMiner system, and other related


information:
– http://db.cs.sfu.ca/ or http://www.cs.sfu.ca/~han

Cosc6211 5
Teaching materials
Textbooks
• Jiawei Han and Micheline Kamber, “Data Mining:
Concepts and Techniques”.
References
• Pang-Ning Tan, Michael Steinbach, and Vipin
Kumar, "Introduction to Data Mining", Pearson
Addison Wesley, 2008, ISBN: 0-32-134136-7
• Margaret H. Dunham, Data Mining: Introductory
and Advanced Topics, Prentice Hall, 2003.
Cosc6211 6
Course outline
1. Introduction , Data preprocessing and
Association Rules Mining
2. Classification and Predication
– Decision Trees , Bayesian Classifier, rule based,
Ensemble and SVM , k-nearest neighbor, Neural
Networks and other classifications
3. Clustering
4. Complex data mining and text mining

Cosc6211 7
Outline
• Motivation: Why data mining?
• What is data mining?
• Data Mining: On what kind of data?
• Data mining Task
• Are all the patterns interesting?
• Major issues in data mining
• Association Rule Mining (if time allows)

Cosc6211 8
1. Introduction and Data preprocessing
and Association rule mining
Why Data Mining?
• The Explosive Growth of Data: from terabytes
to petabytes(1 million gigabytes)
– Data collection and data availability
• Automated data collection tools, database systems,
Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific
simulation, …
• Society and everyone: news, digital cameras, YouTube

Cosc6211 10
Why Data Mining? Commercial Viewpoint
• Lots of data is being collected and warehoused
– Web data
• Google has Peta Bytes of web data
• Facebook has billions of active users
– purchases at department/ grocery stores, e-commerce
• Amazon handles millions of visits/day
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

Cosc6211 11
Why Data Mining? Scientific Viewpoint
• Data collected and stored at enormous speeds
– remote sensors on a satellite
• NASA EOSDIS archives over petabytes of earth science data / year
– telescopes scanning the skies
• Sky survey data
– scientific simulations
• terabytes of data generated in a few hours
• Data mining helps scientists
– in automated analysis of massive datasets
– In hypothesis formation

Cosc6211 12
Why is data mining?

• Make use of your data assets


• There is a big gap from stored data to knowledge;
and the transition won’t occur automatically.
• Many interesting things that one wants to find
cannot be found using database queries
– “find people likely to buy my products”
– “Who are likely to respond to my promotion.
– “Which movies should be recommended to each
customer?”

Cosc6211 13
What Is Data Mining?

• Data mining is also called knowledge discovery and data


mining (KDD)
• Data mining is
– extraction of useful patterns from data sources, e.g.,
databases, texts, web, images, etc.
– Patterns must be:
• valid, novel, potentially useful, understandable
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence.

Cosc6211 14
What is (not) Data Mining?
• Is not data mining:
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
• Is data mining:
– Group together similar documents returned by search engine
according to their context (e.g. Amazon rainforest,
Amazon.com,)
• Identify the following:
– Sales analysis
• What are the sales by quarter and region?
• How do sales compare in two different stores in the same state?

Cosc6211 15
Knowledge Discovery (KDD) Process
– Data mining: the core of
knowledge discovery Knowledge Interpretation
process.
Data Mining

Task-relevant Data
Data transformations

Preprocessed Selection
Data
Data Cleaning

Data Integration

Databases Cosc6211 16
Steps of a KDD Process

• Learning the application domain


– relevant prior knowledge and goals of application
• Data cleaning: missing values, noisy data, and inconsistent data
• Data integration: merging data from multiple data stores
• Data selection: select the data relevant to the analysis
• Data transformation: aggregation (daily sales to weekly or monthly sales)
or generalisation (street to city; age to young, middle age and senior)
• Data mining: apply intelligent methods to extract patterns
• Pattern evaluation: interesting patterns should contradict the user’s
belief or confirm a hypothesis the user wished to validate
• Knowledge presentation: visualization and representation techniques to
present the mined knowledge to the users

Cosc6211 17
Why Data Preprocessing?

• Data in the real world is dirty


– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthdate=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

Cosc6211 18
Why Is Data Dirty?

• Incomplete data may come from


– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and
when it is analyzed.
– Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning

Cosc6211 19
Why Is Data Preprocessing Important?

• No quality data, no quality mining results!


– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
– Data warehouse needs consistent integration of
quality data
• Data extraction, cleaning, and transformation
comprises the majority of the work of building
a data warehouse
Cosc6211 20
Data Mining: on what kinds of data
In principle, data mining should be applicable to any data repository
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Object-relational databases
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Spatial data and spatiotemporal data
– Text databases and Multimedia databases
– The World-Wide Web
– Heterogeneous databases …

Cosc6211 21
Origins of Data Mining

• Data Mining combines


ideas from statistics,
Artificial
machine learning, intelligence

artificial intelligence, and


database systems
– Tries to overcome
shortcomings of Database Pattern
traditional techniques systems recognition

concerning
• large amount of data
• high dimensionality of data
• heterogeneous and
complex nature of data Statistics

Cosc6211 22
Data Mining Tasks
• Descriptive Tasks
– Goal: Find patterns in the data.
– Example: Which products are often bought together?
• Predictive Tasks
– Goal: Predict unknown values of a variable
• given observations (e.g., from the past)
• Machine Learning Terminology
– descriptive = unsupervised
– predictive = supervised

Cosc6211 23
Classic data mining tasks

• Classification:
mining patterns that can classify future (new) data into known
classes.
• Association rule mining
mining any rule of the form X  Y, where X and Y are sets of
data items. E.g.,
Cheese, Milk Bread [sup =5%, confid=80%]
Age(X, ”20..29”) and income(X, ”20k..29k”) -> buys(X, ”cd-
player”) [support=2%, confidence=60%]
• Clustering
identifying a set of similarity groups in the data

Cosc6211 24
Classic data mining tasks (contd)

• Sequential pattern mining:


– A sequential rule: A B, says that event A will be
immediately followed by event B with a certain
confidence
• Deviation, outlier, and novelty detection:
– Discovering the most significant changes in data
• Data visualization
– using graphic methods to show patterns in data.

Cosc6211 25
Data Mining Applications
Market analysis and management
• Target marketing
– Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
– Determine customer purchasing patterns over time
• Cross-market analysis—Find associations/co-
relations between product sales, & predict
based on such association
Cosc6211 26
Data Mining Applications
Market analysis and management(2)
• Customer profiling
– data mining can identify what types of customers
buy what products (clustering or classification)
• Identify customer requirements
– identify the “best” products for different
customers
– use prediction techniques to find what factors will
attract new customers
Cosc6211 27
Data Mining Applications
Fraud Detection & Mining Unusual Patterns
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or week. Analyze patterns
that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
– Anti-terrorism

Cosc6211 28
Are All the “Discovered” Patterns Interesting?

• Data mining may generate thousands of patterns: Not all of


them are interesting
– Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures
– A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc

Cosc6211 29
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy

Cosc6211 30
Summary
• Data Mining is a process of extracting knowledge from data
• Data to be mined can be of any type
– Relational Databases, Advanced databases, etc.
• Knowledge to be discovered
– Frequent patterns, correlations, associations, classification, prediction,
clustering
• Data Mining is interdisciplinary
– Large amount of complex data and sophisticated applications
• Challenges of data Mining
– Efficiency, scalability, parallel and distributed mining, handling high
dimensionality, handling noisy data, mining heterogeneous data, etc.

Cosc6211 31
Where to Find References?

• More conferences on data mining


– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
• Data mining and KDD
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD Explorations
• Database systems
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
– Journals: Machine Learning, Artificial Intelligence, etc.
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.

Cosc6211 32
• Next: Association rule mining

Cosc6211 33

You might also like