By
Saumil Shah
Roll No : 46
MCA 4th sem
WEB MINING
Agenda
World Wide Web – a brief
history
Introduction to Data Mining
Data Mining Process &
Techniques
Web Mining
Data Mining Vs Web Mining
Classification of Web Mining
Benefits & Application Areas of
Web Mining
Web Mining Softwares
Summary
8/12/10
Data Mining vs. Web
Traditional data mining
data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured (HTML documents)and
unstructured (free text)
Mining
readily available data
rich in features and patterns
8/12/10
Problems when interacting with the Web
» Finding relevant information
» Creating new knowledge out of the
information available on the Web
» Personalization of the information
» Learning about consumers or individual users
8/12/10
Web Mining
8/12/10
Web Mining - Definition
» “Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data.”
» The web mining process is similar to the data mining
process, the difference is usually in the data collection.
» In data mining, the data is often already collected and
stored in a data warehouse.
» In web mining, data collection can be a substantial task,
especially for web structure and content mining, which
involves crawling a large number of target web pages.
8/12/10
Web Mining - Subtasks
Resource finding
Retrieving intended documents
Information selection/pre-processing
Select and pre-process specific information from selected
documents
Generalization
Discover general patterns at individual web sites as well as
across multiple web sites
Analysis
Validation and/or interpretation of mined patterns
8/12/10
Web Mining Contd..
Web Mining is not IR:
Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible
Web Mining is not IE:
Information extraction (IE) aims to extract the relevant facts
from given documents
IE systems for the general Web are not feasible
Most focus on specific Web sites or content
8/12/10
Web Usage Mining
Web Usage Mining refers to the discovery of user access
Click to edit the
patterns from the web usage logs, which record every click
made by each user. outline text format
Second Outline
The usage data records the user’s behavior
Level when the user
browses or makes transactions on the web site in order to better
understand and serve the needs of users or− Web-based
Third Outline
applications. Level
Fourth
It is an activity that involves the automatic discovery of
Outline
patterns from one or more Web servers.
Level
− Fifth
Outline
Web Usage Mining Contd..
Organizations often generate and collect large volumes of data;
most of this information is usually generated automatically by
Web servers and collected in server log.
Analyzing such data can help these organizations to
determine:
the value of particular customers
cross marketing strategies across products
the effectiveness of promotional campaigns, etc.
Typical Sources of Data
automatically generated data stored in server access logs,
proxy server logs referrer logs, browser logs, bookmark
data, mouse clicks and scrolls and client-side cookies
user profiles
meta data: page attributes, content attributes, usage data
8/12/10
Web Usage Mining Contd..
The first web analysis tools simply provided mechanisms to
report user activity as recorded in the servers. Using such tools,
it was possible to determine such information as:
the number of accesses to the server
the times or time intervals of visits
the domain names and the URLs of users of the Web server.
Two main categories:
Learning a user profile (personalized)
Web users would be interested in techniques that learn
their needs and preferences automatically
Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques
that improve the effectiveness of their Web site or biasing
the users towards the goals of the site
8/12/10
Web Usage Mining Contd..
Web servers, Web proxies, and client applications can quite
easily capture Web Usage data.
Web server log:
Every visit to the pages, what and when files have been
requested, the IP address of the request, the error code, the
number of bytes sent to user, and the type of browser used…
By analyzing the Web usage data, web mining systems can
discover useful knowledge about a system’s usage
characteristics and the users’ interests which has various
applications:
Personalization and Collaboration in Web-based systems
Marketing
Web site design and evaluation
Decision support
8/12/10
Web Usage Mining
Contd..
The technique to retrieve visitor based information from web
servers based log files and apply this information to analyze
data is known as Web Log Mining.
The major types of log files are
Access Log- file maintains a list of all the web pages that
the visitors have requested.
Agent Log- file consists of information about the browser
that was used to explore the various web pages.
8/12/10
Web Content Mining
Web Content Mining extracts or mines useful information or
knowledge from web page contents.
Click
In this mining, patterns are extracted fromto editsources
online the
such as outline text format
HTML files
Text documents Second Outline
Images Level
E-books or email messages
Audio or Video − Third Outline
Level
The concept of WCM is far wider than searching for any specific
term or only keyword extraction or some simple statistics of words
and phrases in documents.
Fourth
Outline
A tool that performs WCM can summarize a web Level
page so that you
need not read the complete document and save your −time and energy.
Fifth
8/12/10
Outline
Web Content Mining
Contd..
The two basic approaches or models to implement WCM are
Local Knowledge base Model:
The abstract characterizations of several web pages
are stored locally. (i.e References to several web sites relating
to the categories are stored in a database and based on the
selection of the category the searching is performed with in the
web site)
Agent Based Model:
This approach applies the Artificial Intelligence
systems known as Web Agents that can perform a search on
behalf of a particular user for discovering and organizing
documents in the web. Some web agents can apply individual
user profiles for searching information from the web and
organize and interpret the discovered information.
8/12/10
Preprocessing Content
Content Preparation:
Extract text from HTML.
Perform Stemming.
Remove Stop Words.
Calculate Collection Wide Word Frequencies (DF).
Calculate per Document Term Frequencies (TF).
Vector Creation:
Common Information Retrieval Technique.
Each document (HTML page) is represented by a sparse
vector of term weights.
Typically, additional weight is given to terms appearing as
keywords or in titles.
8/12/10
Common Mining Techniques
The more basic and popular data mining techniques include:
Classification- Classification on server logs using decision trees,
Naives-Bayes classifier to discover the profiles of users
belonging to a particular category.
Clustering- can be used to group users exhibiting similar
browsing patterns.
Associations- can be used to relate pages that are most often
referenced together in a single server session.
The other significant ideas are:
Topic Identification, tracking and drift analysis
Concept hierarchy creation
Relevance of content.
8/12/10
Web Structure Mining
Web Structure Mining discovers useful knowledge from
hyper links, which represent the structure of the web.
Click to edit the
outline
Web structure mining can be divided text
into two format
kinds:
Extract patterns from hyperlinks in the web. A hyperlink is
Second Outline
a structural component that connects the web page to a
different location. Level
− Third
Mining the document structure. It is using the tree-like
Outline
structure to analyze and describe the HTML
Levelor XML tags
within the web page.
Fourth
Outline
The process of using the graph theory to analyze the node
and connection structure of a web site. Level
− Fifth
8/12/10
Outline
Web Structure Mining
Contd..
Web Structure is a useful source for extracting information
such as
Web Page Classification
Classifying web pages according to various topics
Quality of Web Page
The authority of a page on a topic
Ranking of web pages
Which pages to crawl
Deciding which web pages to add to the collection of web
pages
Finding Related Pages
Given one relevant page, find all related pages
8/12/10
Web Structure Mining
Contd..
The Hyperlink Induced Topic Search (HITS) is the common
method or algorithm for knowledge discovery in the Web. The
Concept of HITS is
8/12/10
Web Structure Mining
Identication of
Authorities: authoritative, high-quality web pages on broad
topics
hubs: web pages that link to a collection of authorities
A good authority is pointed to by many good hubs
A good hub points to many good authorities
Web structure mining has been largely influenced by research
in
Social network analysis
Citation analysis (bibliometrics).
in-links: the hyperlinks pointing to a page
out-links: the hyperlinks found in a page.
Usually, the larger the number of in-links, the better a page is.
8/12/10
Application Areas of Web Mining
E-commerce
Search Engines
Personalization
Website Design
Web mining applications
Amazon.com
Google
Double Click
AOL
Ebay
MyYahoo
CiteSeer
I-MODE
v-TAG Web Mining Server
8/12/10
Applications Contd..
Amazon:
A host of Web mining techniques, e.g. associations between
pages visited, click-path analysis, etc., are used to improve the
customer’s experience during a ’store visit’. Knowledge gained
from Web mining is the key intelligence behind Amazon’s
features such as ’instant recommendations’, ’purchase circles’,
’wish-lists’, etc.
8/12/10
Applications Contd..
Google
Earlier search engines concentrated on the Web content to
return the relevant pages to a query. Google was the first to
introduce the importance of the link structure in mining the
information from the web. Page Rank, that measures an
importance of a page, is the underlying technology in all
Google search products.
The Page Rank technology, that makes use of the structural
information of the Web graph, is the key to returning quality
results relevant to a query.
8/12/10
Benefits of Web Mining
Match your available resources to visitor interests
Increase the value of each visitor
Improve the visitor's experience at the website
Perform targeted resource management
Collect information in new ways
Test the relevance of content and web site architecture
8/12/10
Web Mining Softwares
Web Miner:
Sinope Summarizer:
Teleport Pro:
Click Tracks
8/12/10
Summary
Major Limitations of Web Mining research:
Difficult to collect Web Usage data across different Web
Sites.
Lack of suitable test collections that can be reused by
researchers
Future research directions:
Multimedia data mining: A picture is worth a thousand
words.
Multilingual knowledge extraction: Web page translations
The Hidden Web: Forms, Dynamically generated web pages.
Semantic Web
Wireless Web: WML and HDML.
8/12/10