0% found this document useful (0 votes)
21 views41 pages

Lecture 10 - Data Mining in Practice

Web mining involves analyzing data from the web to discover useful information. There are three types of web mining: web content mining extracts information from web pages, web structure mining analyzes link structures between pages, and web usage mining examines user behavior data from web logs. Common applications of web mining include search engines, recommendation systems, and market analysis.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views41 pages

Lecture 10 - Data Mining in Practice

Web mining involves analyzing data from the web to discover useful information. There are three types of web mining: web content mining extracts information from web pages, web structure mining analyzes link structures between pages, and web usage mining examines user behavior data from web logs. Common applications of web mining include search engines, recommendation systems, and market analysis.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Lecture 10

Data Mining in Practice


Overview

•Data mining life cycle


•Data mining applications
Data Mining Life Cycle
Data Mining Life Cycle
Cross Industry Standard Process for Data Mining
(CRISP-DM) (appeared in 1990)

https://www.sv-europe.com/crisp-dm-methodology/
Business Understanding

• Determine business objectives


• Background of business
• Business Objectives
• Business Success Criteria
• Assess business situation
• What are the available resources
• Determine data mining goals
• Success criteria
• Produce project plan
Data Understanding

• Collect initial data


• From various sources
• Describe data
• Meta data
• Explore data
• Gain initial understanding of the data
• Verify data quality
• Check attribute values: missing values
Data Preparation

• Select data
• Rationale for inclusion/exclusion of data
• Clean data
• Replace missing values, normalization, corrections, etc
• Construct data
• Deriving new values
• Integrate data
• Merge data (if necessary)
• Format data
• Preparing data to be read by data mining techniques
Modeling

• Data mining
• Select modeling technique (algorithms)
• Generate test design –for example in classification, dataset divided into
training and test set
• Build model
• Assess model
• Revised parameters
Evaluation

• Evaluate results
• Check how the model performed
• Must align with business objectives
• Approved models
• Review the models before endorsed by experts
• Must find support
• Determine next steps
• Deployment (indicating successful deployment of project)
• Or review business objectives (go through another round of data mining
or start a completely new project with different business objectives)
Deployment

• Applied to business practice


• Monitoring and maintenance
• Expecting positive outcome
• Produce final report
• Meet business objectives?
Cross Industry Standard Process for Data Mining
(CRISP-DM) (appeared in 1990)

https://www.sv-europe.com/crisp-dm-methodology/
Building A Loan Approval
Model
Approved or rejected
Poor
performance
Perform like
human?
Data mining applications

1 2
Text mining Web mining
Text Mining
Text Mining

• Text mining = Text data mining


• Text mining
Application of data mining to nonstructured or less
structured text files. It entails the generation of meaningful
numerical indices from the unstructured text and then
processing these indices using various data mining
algorithms
Ar
tifi
Pattern

c
ial
Recognition

s
tic

Int
tis
Text

ellig
Sta

en
ce
DATA
Mining
Machine
MINING Learning

Mathematical
Modeling Databases

Management Science &


Information Systems
Text and text pre-processing

• A collection of text – corpus


• Component – WORDS
• Many non-significant words – “the”, “a” etc
• Same word occur in different forms – “study”, “studying”,
“student” etc
• Method to select significant words – entropy, TFIDF etc
• Presenting all words in document-word matrix
Text Mining
Document-word matrix

DocID database cable network broadband text model keyboard


Doc1 1 1 1 1 0 0 0
Doc2 0 0 1 0 0 0 1
Doc3 1 0 0 0 0 1 0
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
DocN .. .. .. .. .. .. ..
Text Mining

• Text mining works with text documents.


• It extracts the documents' features and uses qualitative analysis.
• Text mining helps organizations:
• Find the “hidden” content of documents, including
additional useful relationships
• Relate documents across previous unnoticed divisions
• Group documents by common themes
Text Mining

• Applications of text mining


• Automatic detection of e-mail spam or phishing through
analysis of the document content
• Automatic processing of messages or e-mails to route a
message to the most appropriate party to process that
message
• Analysis of warranty claims, help desk calls/reports, and
so on to identify the most common problems and
relevant responses
• In bioinformatics, find proteins which have many
different names
Text Mining

• Applications of text mining


• Analysis of related scientific publications in
journals to create an automated summary view
of a particular discipline
• Creation of a “relationship view” of a document
collection
• Qualitative analysis of documents to detect
deception
• Tweet Analysis
• Extracting Knowledge from Facebook
Text Mining – Text pre-processing

How to mine text


1. Words are extracted from text documents.
2. Eliminate commonly used words (stop-words), for eg: the, a, in, on,
about
3. Replace words with their stems or roots (stemming algorithms), for
eg: program, programs, programming all becomes “program”.
4. Consider (unifying) synonyms and phrases to reduce number of
words.
5. Calculate the weights/frequencies of the remaining terms.
6. Store it in a document-word matrix which is the basis for further
text mining.
Text Mining – Association Analysis

• Association analysis
• Each document is a “transaction” and list of keywords is the “list
of items”.
• A collection of documents will form the “transaction database”.
• Example of association rules: {data, mining} → {clustering, Naïve,
Bayes}
• Problem with this kind of keyword association discovery is that
many association patterns maybe discovered. Thus, some
associations maybe shallow in meaning and indicate only co-
occurences.
• Frequently occurring keywords may serve the purpose of phrase
extraction, for eg: {human} → {computer, interaction}
Text Mining - Classification

• Text categorization/document classification


• Classifying a given document into one of several predefined classes.
• Attributes= keywords
• Used to classify an unseen document
• Example of applications:
• Use classification models to determine whether incoming
emails are junk emails.
• Search web page contents and do automatic categorization
for effective indexing.
• Personal organizer on a desktop PC can use the classification
model to organize files on the local disk.
Text Mining - Classification

• Documents contain a lot of keywords.


• Problems:
• Too many attributes exist
• Inefficient for decision tree induction
• The tree is likely to be large, complex and suffer from
overfitting, causing poor performance.
•Naïve Bayes performs fairly well
Text Mining - Clustering

• Group documents containing similar words (partitioning


documents into groups according to similar keywords
contained in the documents).
• High dimensionality problem
• Data objects are diverse and distances between objects are
uniform.
• Therefore, there it is difficult to separate objects because
similarity within groups is more or less the same as
similarity between groups.
• Dimension reduction techniques can be used to reduce
dimensions especially for text documents, for eg: latent
semantic indexing (LSI), probabilistic LSI (PLSI) and locality
preserving indexing (LPI).
Text Mining Demo

• http://books.google.com/ngrams
Web Mining
Web Mining

• Web mining is the discovery and analysis of interesting and


useful information from the Web, about the Web, and usually
through Web-based tools.
• Documents on the web tend to be
• Semantically rich: in its content.
• Multimedia: contain text on a specific subject and also
advertisements, flash animations and images. (This makes
information extraction more difficult compared to in text
mining.)
• Semi-structured: mark up tags indicate certain structure but
the text within the paragraphs are still unstructured.
• Dynamic: new pages are added constantly or if a database is
used, the content of a web page can also be dynamic.
Web Mining Types
Web Mining

Web content • The extraction of useful information from


mining Web pages

Web structure • The development of useful information


from the links included in the Web
mining documents

Web usage • The extraction of useful information from


the data being generated through
mining webpage visits, transactions, etc.
Web Content
Web
Structure -
PageRank
Web Mining

• Uses for web mining:


• Target electronic advertisements and coupons at
user groups.
• Predict user behavior, for eg: online purchasing
patterns, monitor customers’ online shopping
behaviour
• Assist website designers to discover design flaws
and improve website structure.
References

• Chapter 10 of Data Mining Techniques and Applications

You might also like