0% found this document useful (0 votes)

21 views41 pages

Lecture 10 - Data Mining in Practice

Web mining involves analyzing data from the web to discover useful information. There are three types of web mining: web content mining extracts information from web pages, web structure mining analyzes link structures between pages, and web usage mining examines user behavior data from web logs. Common applications of web mining include search engines, recommendation systems, and market analysis.

Uploaded by

johndeuterok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views41 pages

Lecture 10 - Data Mining in Practice

Uploaded by

johndeuterok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Lecture 10

Data Mining in Practice

Overview

•Data mining life cycle

•Data mining applications
Data Mining Life Cycle
Data Mining Life Cycle
Cross Industry Standard Process for Data Mining
(CRISP-DM) (appeared in 1990)

https://www.sv-europe.com/crisp-dm-methodology/
Business Understanding

• Determine business objectives

• Background of business
• Business Objectives
• Business Success Criteria
• Assess business situation
• What are the available resources
• Determine data mining goals
• Success criteria
• Produce project plan
Data Understanding

• Collect initial data

• From various sources
• Describe data
• Meta data
• Explore data
• Gain initial understanding of the data
• Verify data quality
• Check attribute values: missing values
Data Preparation

• Select data
• Rationale for inclusion/exclusion of data
• Clean data
• Replace missing values, normalization, corrections, etc
• Construct data
• Deriving new values
• Integrate data
• Merge data (if necessary)
• Format data
• Preparing data to be read by data mining techniques
Modeling

• Data mining
• Select modeling technique (algorithms)
• Generate test design –for example in classification, dataset divided into
training and test set
• Build model
• Assess model
• Revised parameters
Evaluation

• Evaluate results
• Check how the model performed
• Must align with business objectives
• Approved models
• Review the models before endorsed by experts
• Must find support
• Determine next steps
• Deployment (indicating successful deployment of project)
• Or review business objectives (go through another round of data mining
or start a completely new project with different business objectives)
Deployment

• Applied to business practice

• Monitoring and maintenance
• Expecting positive outcome
• Produce final report
• Meet business objectives?
Cross Industry Standard Process for Data Mining
(CRISP-DM) (appeared in 1990)

https://www.sv-europe.com/crisp-dm-methodology/
Building A Loan Approval
Model
Approved or rejected
Poor
performance
Perform like
human?
Data mining applications

1 2
Text mining Web mining
Text Mining
Text Mining

• Text mining = Text data mining

• Text mining
Application of data mining to nonstructured or less
structured text files. It entails the generation of meaningful
numerical indices from the unstructured text and then
processing these indices using various data mining
algorithms
Ar
tifi
Pattern

c
ial
Recognition

s
tic

Int
tis
Text

ellig
Sta

en
ce
DATA
Mining
Machine
MINING Learning

Mathematical
Modeling Databases

Management Science &

Information Systems
Text and text pre-processing

• A collection of text – corpus

• Component – WORDS
• Many non-significant words – “the”, “a” etc
• Same word occur in different forms – “study”, “studying”,
“student” etc
• Method to select significant words – entropy, TFIDF etc
• Presenting all words in document-word matrix
Text Mining
Document-word matrix

DocID database cable network broadband text model keyboard

Doc1 1 1 1 1 0 0 0
Doc2 0 0 1 0 0 0 1
Doc3 1 0 0 0 0 1 0
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
DocN .. .. .. .. .. .. ..
Text Mining

• Text mining works with text documents.

• It extracts the documents' features and uses qualitative analysis.
• Text mining helps organizations:
• Find the “hidden” content of documents, including
additional useful relationships
• Relate documents across previous unnoticed divisions
• Group documents by common themes
Text Mining

• Applications of text mining

• Automatic detection of e-mail spam or phishing through
analysis of the document content
• Automatic processing of messages or e-mails to route a
message to the most appropriate party to process that
message
• Analysis of warranty claims, help desk calls/reports, and
so on to identify the most common problems and
relevant responses
• In bioinformatics, find proteins which have many
different names
Text Mining

• Applications of text mining

• Analysis of related scientific publications in
journals to create an automated summary view
of a particular discipline
• Creation of a “relationship view” of a document
collection
• Qualitative analysis of documents to detect
deception
• Tweet Analysis
• Extracting Knowledge from Facebook
Text Mining – Text pre-processing

How to mine text

1. Words are extracted from text documents.
2. Eliminate commonly used words (stop-words), for eg: the, a, in, on,
about
3. Replace words with their stems or roots (stemming algorithms), for
eg: program, programs, programming all becomes “program”.
4. Consider (unifying) synonyms and phrases to reduce number of
words.
5. Calculate the weights/frequencies of the remaining terms.
6. Store it in a document-word matrix which is the basis for further
text mining.
Text Mining – Association Analysis

• Association analysis
• Each document is a “transaction” and list of keywords is the “list
of items”.
• A collection of documents will form the “transaction database”.
• Example of association rules: {data, mining} → {clustering, Naïve,
Bayes}
• Problem with this kind of keyword association discovery is that
many association patterns maybe discovered. Thus, some
associations maybe shallow in meaning and indicate only co-
occurences.
• Frequently occurring keywords may serve the purpose of phrase
extraction, for eg: {human} → {computer, interaction}
Text Mining - Classification

• Text categorization/document classification

• Classifying a given document into one of several predefined classes.
• Attributes= keywords
• Used to classify an unseen document
• Example of applications:
• Use classification models to determine whether incoming
emails are junk emails.
• Search web page contents and do automatic categorization
for effective indexing.
• Personal organizer on a desktop PC can use the classification
model to organize files on the local disk.
Text Mining - Classification

• Documents contain a lot of keywords.

• Problems:
• Too many attributes exist
• Inefficient for decision tree induction
• The tree is likely to be large, complex and suffer from
overfitting, causing poor performance.
•Naïve Bayes performs fairly well
Text Mining - Clustering

• Group documents containing similar words (partitioning

documents into groups according to similar keywords
contained in the documents).
• High dimensionality problem
• Data objects are diverse and distances between objects are
uniform.
• Therefore, there it is difficult to separate objects because
similarity within groups is more or less the same as
similarity between groups.
• Dimension reduction techniques can be used to reduce
dimensions especially for text documents, for eg: latent
semantic indexing (LSI), probabilistic LSI (PLSI) and locality
preserving indexing (LPI).
Text Mining Demo

• http://books.google.com/ngrams
Web Mining
Web Mining

• Web mining is the discovery and analysis of interesting and

useful information from the Web, about the Web, and usually
through Web-based tools.
• Documents on the web tend to be
• Semantically rich: in its content.
• Multimedia: contain text on a specific subject and also
advertisements, flash animations and images. (This makes
information extraction more difficult compared to in text
mining.)
• Semi-structured: mark up tags indicate certain structure but
the text within the paragraphs are still unstructured.
• Dynamic: new pages are added constantly or if a database is
used, the content of a web page can also be dynamic.
Web Mining Types
Web Mining

Web content • The extraction of useful information from

mining Web pages

Web structure • The development of useful information

from the links included in the Web
mining documents

Web usage • The extraction of useful information from

the data being generated through
mining webpage visits, transactions, etc.
Web Content
Web
Structure -
PageRank
Web Mining

• Uses for web mining:

• Target electronic advertisements and coupons at
user groups.
• Predict user behavior, for eg: online purchasing
patterns, monitor customers’ online shopping
behaviour
• Assist website designers to discover design flaws
and improve website structure.
References

• Chapter 10 of Data Mining Techniques and Applications

AZ-204 Exam Questions With Aswers Latest
100% (2)
AZ-204 Exam Questions With Aswers Latest
34 pages
Document Management System (DMS) Release 4.5 User Guide: Reference: SW - 172 (PMIS/7631) Issue: 4.5
No ratings yet
Document Management System (DMS) Release 4.5 User Guide: Reference: SW - 172 (PMIS/7631) Issue: 4.5
46 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
No ratings yet
Data Mining Concepts and Applications: Six Factors Behind The Sudden Rise in Popularity of Data Mining
36 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Data Mining
No ratings yet
Data Mining
34 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Data Bases Data Ware Hous e Pre Proces Sed Data Mine D Data Disco Vered Know Ledge Data Cleaning Data Integration Data Mining
No ratings yet
Data Bases Data Ware Hous e Pre Proces Sed Data Mine D Data Disco Vered Know Ledge Data Cleaning Data Integration Data Mining
7 pages
A Brief Survey of Text Mining: Andreas Hotho KDE Group University of Kassel
No ratings yet
A Brief Survey of Text Mining: Andreas Hotho KDE Group University of Kassel
37 pages
Hot Ho 05 Text Mining
No ratings yet
Hot Ho 05 Text Mining
37 pages
Unit I –Text Mining
No ratings yet
Unit I –Text Mining
48 pages
Assignment Rubel - Data Mining
No ratings yet
Assignment Rubel - Data Mining
12 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Text Mining
No ratings yet
Text Mining
16 pages
Prof. Mohammed Tanzeem Agra
No ratings yet
Prof. Mohammed Tanzeem Agra
33 pages
DATA MINING IN BUSINESS INTELLIGENCE
No ratings yet
DATA MINING IN BUSINESS INTELLIGENCE
63 pages
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Study on Web Designing
No ratings yet
Study on Web Designing
8 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Intro_1
No ratings yet
Intro_1
43 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Thesis Chapterwise
No ratings yet
Thesis Chapterwise
52 pages
TMK DWDM Unit 7 Advance Topics
No ratings yet
TMK DWDM Unit 7 Advance Topics
28 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
17 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
No ratings yet
Applications & Trends in Data Mining: Gaurav Gupta, Geetika Hans, Tamanna Sehgal
3 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Assignment 5
No ratings yet
Assignment 5
16 pages
DM LAQS
No ratings yet
DM LAQS
14 pages
Lecture 5- Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5- Text Mining Sentiment and Social Media Analytics
52 pages
Differentiating Between Data-Mining and Text-Mining Terminology
No ratings yet
Differentiating Between Data-Mining and Text-Mining Terminology
15 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Unit 5 DM
No ratings yet
Unit 5 DM
11 pages
DMM-finals
No ratings yet
DMM-finals
30 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
EBM
No ratings yet
EBM
16 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Text Mining and Its Business Applications
No ratings yet
Text Mining and Its Business Applications
17 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Data Mining Unit4
No ratings yet
Data Mining Unit4
16 pages
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
No ratings yet
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
7 pages
Text_Mining_
No ratings yet
Text_Mining_
10 pages
Internal
No ratings yet
Internal
267 pages
21IS503 UnitII LM5
No ratings yet
21IS503 UnitII LM5
20 pages
Module 4
No ratings yet
Module 4
63 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 7 - Classification (Rules and Naïve Bayes)
100% (1)
Lecture 7 - Classification (Rules and Naïve Bayes)
19 pages
VDM Manual
No ratings yet
VDM Manual
192 pages
L03 Logic Overview-Q
No ratings yet
L03 Logic Overview-Q
35 pages
L04 Type Definitions-Q
No ratings yet
L04 Type Definitions-Q
48 pages
Development of Indexes Indexing
No ratings yet
Development of Indexes Indexing
22 pages
Elasticsearch Quick Start: An Introduction To Elasticsearch in Tutorial Form
No ratings yet
Elasticsearch Quick Start: An Introduction To Elasticsearch in Tutorial Form
21 pages
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
No ratings yet
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
6 pages
TM09 Monitoring and Supporting Data Conversion
No ratings yet
TM09 Monitoring and Supporting Data Conversion
28 pages
Chapter 7
No ratings yet
Chapter 7
98 pages
UNIT-V NLP
No ratings yet
UNIT-V NLP
25 pages
Search Strategies For Online Databases
No ratings yet
Search Strategies For Online Databases
29 pages
GRTrend X
No ratings yet
GRTrend X
30 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
46 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
The User/Catalogue Interface Making It Easier To Find Things For Users
No ratings yet
The User/Catalogue Interface Making It Easier To Find Things For Users
7 pages
About Kirk-Othmer ECT (CD) About Kirk-Othmer ECT (CD)
No ratings yet
About Kirk-Othmer ECT (CD) About Kirk-Othmer ECT (CD)
17 pages
Information Retrieval Systems (Pe-1)
No ratings yet
Information Retrieval Systems (Pe-1)
2 pages
Autodesk Raster Design Manual
100% (1)
Autodesk Raster Design Manual
166 pages
IR - Models
100% (3)
IR - Models
58 pages
Tries: Symbol Table Review
No ratings yet
Tries: Symbol Table Review
8 pages
SEO Complete Guide by Surojit
No ratings yet
SEO Complete Guide by Surojit
55 pages
Goal-Centric Traceability For Managing Non-Functional Requirements
No ratings yet
Goal-Centric Traceability For Managing Non-Functional Requirements
10 pages
Security Day 05 Google Hacking
No ratings yet
Security Day 05 Google Hacking
21 pages
Configuring Search
No ratings yet
Configuring Search
32 pages
Eight Ways To Search Socindex
No ratings yet
Eight Ways To Search Socindex
9 pages
1.3 PPT - Measure of Query Cost
100% (1)
1.3 PPT - Measure of Query Cost
42 pages
A Comprehensive Survey On Query Expansion Techniques, Their Issues and Challenges
No ratings yet
A Comprehensive Survey On Query Expansion Techniques, Their Issues and Challenges
4 pages
Splunk and Sysmon
No ratings yet
Splunk and Sysmon
18 pages
digital marketing communication lecture notes
No ratings yet
digital marketing communication lecture notes
46 pages
Venus Pathak
No ratings yet
Venus Pathak
8 pages
SE 311 - Software Architecture II: "KWIC Is An Acronym For Keyword in Context"
No ratings yet
SE 311 - Software Architecture II: "KWIC Is An Acronym For Keyword in Context"
2 pages