0% found this document useful (0 votes)

93 views28 pages

Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem

This document discusses web mining and its various types. It begins with introducing data mining and comparing it with web mining. It then classifies web mining into three main types - web usage mining, web content mining, and web structure mining. For each type, it provides details on definition, processes involved, sources of data, applications and techniques used.

Uploaded by

Saumil Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views28 pages

Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem

Uploaded by

Saumil Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

By

Saumil Shah
Roll No : 46
MCA 4th sem

WEB MINING
Agenda
World Wide Web – a brief
history
Introduction to Data Mining
Data Mining Process &
Techniques
Web Mining
Data Mining Vs Web Mining
Classification of Web Mining
Benefits & Application Areas of
Web Mining
Web Mining Softwares
Summary
8/12/10
Data Mining vs. Web
Traditional data mining
data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured (HTML documents)and
unstructured (free text)
Mining

readily available data

rich in features and patterns

8/12/10
Problems when interacting with the Web

» Finding relevant information

» Creating new knowledge out of the

information available on the Web

» Personalization of the information

» Learning about consumers or individual users

8/12/10
Web Mining

8/12/10
Web Mining - Definition
» “Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data.”

» The web mining process is similar to the data mining

process, the difference is usually in the data collection.
» In data mining, the data is often already collected and
stored in a data warehouse.
» In web mining, data collection can be a substantial task,
especially for web structure and content mining, which
involves crawling a large number of target web pages.

8/12/10
Web Mining - Subtasks
 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from selected
documents
 Generalization
 Discover general patterns at individual web sites as well as
across multiple web sites
 Analysis
 Validation and/or interpretation of mined patterns

8/12/10
Web Mining Contd..
Web Mining is not IR:
 Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible

Web Mining is not IE:

 Information extraction (IE) aims to extract the relevant facts
from given documents
 IE systems for the general Web are not feasible
 Most focus on specific Web sites or content

8/12/10
Web Usage Mining

Web Usage Mining refers to the discovery of user access

Click to edit the
patterns from the web usage logs, which record every click
made by each user. outline text format
Second Outline
The usage data records the user’s behavior
Level when the user
browses or makes transactions on the web site in order to better
understand and serve the needs of users or− Web-based
Third Outline
applications. Level
Fourth
It is an activity that involves the automatic discovery of
Outline
patterns from one or more Web servers.
Level
− Fifth
Outline
Web Usage Mining Contd..
Organizations often generate and collect large volumes of data;
most of this information is usually generated automatically by
Web servers and collected in server log.

Analyzing such data can help these organizations to

determine:
the value of particular customers
cross marketing strategies across products
the effectiveness of promotional campaigns, etc.
Typical Sources of Data
automatically generated data stored in server access logs,
proxy server logs referrer logs, browser logs, bookmark
data, mouse clicks and scrolls and client-side cookies
user profiles
 meta data: page attributes, content attributes, usage data
8/12/10
Web Usage Mining Contd..
 The first web analysis tools simply provided mechanisms to
report user activity as recorded in the servers. Using such tools,
it was possible to determine such information as:
the number of accesses to the server
the times or time intervals of visits
the domain names and the URLs of users of the Web server.
 Two main categories:
Learning a user profile (personalized)
Web users would be interested in techniques that learn
their needs and preferences automatically
Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques
that improve the effectiveness of their Web site or biasing
the users towards the goals of the site
8/12/10
Web Usage Mining Contd..
 Web servers, Web proxies, and client applications can quite
easily capture Web Usage data.
Web server log:
Every visit to the pages, what and when files have been
requested, the IP address of the request, the error code, the
number of bytes sent to user, and the type of browser used…
 By analyzing the Web usage data, web mining systems can
discover useful knowledge about a system’s usage
characteristics and the users’ interests which has various
applications:
Personalization and Collaboration in Web-based systems
Marketing
Web site design and evaluation
Decision support
8/12/10
Web Usage Mining
Contd..
The technique to retrieve visitor based information from web
servers based log files and apply this information to analyze
data is known as Web Log Mining.
The major types of log files are
Access Log- file maintains a list of all the web pages that
the visitors have requested.
Agent Log- file consists of information about the browser
that was used to explore the various web pages.

8/12/10
Web Content Mining
Web Content Mining extracts or mines useful information or
knowledge from web page contents.
Click
In this mining, patterns are extracted fromto editsources
online the
such as outline text format
HTML files
Text documents Second Outline
Images Level
E-books or email messages
Audio or Video − Third Outline
Level
The concept of WCM is far wider than searching for any specific
term or only keyword extraction or some simple statistics of words
and phrases in documents.
Fourth
Outline
A tool that performs WCM can summarize a web Level
page so that you
need not read the complete document and save your −time and energy.
Fifth
8/12/10
Outline
Web Content Mining
Contd..
The two basic approaches or models to implement WCM are
Local Knowledge base Model:
The abstract characterizations of several web pages
are stored locally. (i.e References to several web sites relating
to the categories are stored in a database and based on the
selection of the category the searching is performed with in the
web site)
Agent Based Model:
This approach applies the Artificial Intelligence
systems known as Web Agents that can perform a search on
behalf of a particular user for discovering and organizing
documents in the web. Some web agents can apply individual
user profiles for searching information from the web and
organize and interpret the discovered information.
8/12/10
Preprocessing Content
Content Preparation:
Extract text from HTML.
Perform Stemming.
Remove Stop Words.
Calculate Collection Wide Word Frequencies (DF).
Calculate per Document Term Frequencies (TF).
Vector Creation:
Common Information Retrieval Technique.
Each document (HTML page) is represented by a sparse
vector of term weights.
Typically, additional weight is given to terms appearing as
keywords or in titles.

8/12/10
Common Mining Techniques
The more basic and popular data mining techniques include:
Classification- Classification on server logs using decision trees,
Naives-Bayes classifier to discover the profiles of users
belonging to a particular category.
Clustering- can be used to group users exhibiting similar
browsing patterns.
Associations- can be used to relate pages that are most often
referenced together in a single server session.
The other significant ideas are:
Topic Identification, tracking and drift analysis
Concept hierarchy creation
Relevance of content.

8/12/10
Web Structure Mining
Web Structure Mining discovers useful knowledge from
hyper links, which represent the structure of the web.
Click to edit the
outline
 Web structure mining can be divided text
into two format
kinds:
Extract patterns from hyperlinks in the web. A hyperlink is
Second Outline
a structural component that connects the web page to a
different location. Level
− Third
Mining the document structure. It is using the tree-like
Outline
structure to analyze and describe the HTML
Levelor XML tags
within the web page.
Fourth
Outline
 The process of using the graph theory to analyze the node
and connection structure of a web site. Level
− Fifth
8/12/10
Outline
Web Structure Mining
Contd..
Web Structure is a useful source for extracting information
such as
Web Page Classification
 Classifying web pages according to various topics
Quality of Web Page
The authority of a page on a topic
Ranking of web pages
Which pages to crawl
 Deciding which web pages to add to the collection of web
pages
Finding Related Pages
Given one relevant page, find all related pages

8/12/10
Web Structure Mining
Contd..
The Hyperlink Induced Topic Search (HITS) is the common
method or algorithm for knowledge discovery in the Web. The
Concept of HITS is

8/12/10
Web Structure Mining
Identication of
Authorities: authoritative, high-quality web pages on broad
topics
hubs: web pages that link to a collection of authorities
A good authority is pointed to by many good hubs
A good hub points to many good authorities

Web structure mining has been largely influenced by research

in
Social network analysis
Citation analysis (bibliometrics).
in-links: the hyperlinks pointing to a page
out-links: the hyperlinks found in a page.
Usually, the larger the number of in-links, the better a page is.
8/12/10
Application Areas of Web Mining
E-commerce
Search Engines
Personalization
Website Design
Web mining applications
Amazon.com
Google
Double Click
AOL
Ebay
MyYahoo
CiteSeer
I-MODE
v-TAG Web Mining Server

8/12/10
Applications Contd..
Amazon:
A host of Web mining techniques, e.g. associations between
pages visited, click-path analysis, etc., are used to improve the
customer’s experience during a ’store visit’. Knowledge gained
from Web mining is the key intelligence behind Amazon’s
features such as ’instant recommendations’, ’purchase circles’,
’wish-lists’, etc.

8/12/10
Applications Contd..
Google
 Earlier search engines concentrated on the Web content to
return the relevant pages to a query. Google was the first to
introduce the importance of the link structure in mining the
information from the web. Page Rank, that measures an
importance of a page, is the underlying technology in all
Google search products.

 The Page Rank technology, that makes use of the structural

information of the Web graph, is the key to returning quality
results relevant to a query.

8/12/10
Benefits of Web Mining
Match your available resources to visitor interests

Increase the value of each visitor

Improve the visitor's experience at the website

Perform targeted resource management

Collect information in new ways

Test the relevance of content and web site architecture

8/12/10
Web Mining Softwares
Web Miner:

Sinope Summarizer:

Teleport Pro:

Click Tracks

8/12/10
Summary
Major Limitations of Web Mining research:
Difficult to collect Web Usage data across different Web
Sites.
Lack of suitable test collections that can be reused by
researchers

Future research directions:

Multimedia data mining: A picture is worth a thousand
words.
Multilingual knowledge extraction: Web page translations
The Hidden Web: Forms, Dynamically generated web pages.
Semantic Web
Wireless Web: WML and HDML.
8/12/10

Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web_mining_171317705012335496661d01dac5fa2
No ratings yet
Web_mining_171317705012335496661d01dac5fa2
48 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
Semantic Web: (An Introduction)
100% (1)
Semantic Web: (An Introduction)
39 pages
Python in Excel (2024)
100% (12)
Python in Excel (2024)
607 pages
Data Mining Unit4 5
No ratings yet
Data Mining Unit4 5
130 pages
Web Mining
100% (3)
Web Mining
28 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (12)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining
No ratings yet
Web Mining
42 pages
Week 1
No ratings yet
Week 1
80 pages
Web Mining
No ratings yet
Web Mining
53 pages
Web Content Mining
100% (1)
Web Content Mining
112 pages
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (16)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
Webmining I
No ratings yet
Webmining I
69 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
96% (28)
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
616 pages
The Python Manual
97% (32)
The Python Manual
196 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Extraction of Historical Events From Wikipedia
100% (1)
Extraction of Historical Events From Wikipedia
12 pages
Web Mining
No ratings yet
Web Mining
73 pages
Linked Data Based Exploration A State-Of-The-Art Camera-Ready
100% (1)
Linked Data Based Exploration A State-Of-The-Art Camera-Ready
13 pages
Data Mining
No ratings yet
Data Mining
12 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
100% (14)
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
171 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
Web Mining
No ratings yet
Web Mining
42 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
DWM REPORT
No ratings yet
DWM REPORT
12 pages
Internet Resouces
No ratings yet
Internet Resouces
16 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining
No ratings yet
Web Mining
14 pages
Chat GPT For Dummies. A Quick Introduction To Prompt Engineering 2023
92% (12)
Chat GPT For Dummies. A Quick Introduction To Prompt Engineering 2023
33 pages
Web Mining
No ratings yet
Web Mining
28 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
Data Mining: Web Data Mining Techniques, Tools and Algorithms: An Overview
No ratings yet
Data Mining: Web Data Mining Techniques, Tools and Algorithms: An Overview
9 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Dbpedia: A Nucleus For A Web of Open Data
No ratings yet
Dbpedia: A Nucleus For A Web of Open Data
14 pages
Web Mining
No ratings yet
Web Mining
20 pages
Linked Data Evolving The Web Into A Global Data Space 1st Edition Tom Heath - The Latest Updated Ebook Is Now Available For Download
100% (2)
Linked Data Evolving The Web Into A Global Data Space 1st Edition Tom Heath - The Latest Updated Ebook Is Now Available For Download
82 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (21)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Grafa: Scalable Faceted Browsing For RDF Graphs
No ratings yet
Grafa: Scalable Faceted Browsing For RDF Graphs
16 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Data Analysis With Microsoft Excel
92% (25)
Data Analysis With Microsoft Excel
532 pages
Extracting Data Through Webmining: Mrs - Bhanu Bhardwaj Asst Proff DCE G.Noida
No ratings yet
Extracting Data Through Webmining: Mrs - Bhanu Bhardwaj Asst Proff DCE G.Noida
6 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
RDF Journal Compilation
No ratings yet
RDF Journal Compilation
7 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
(Hunt, J.) A Beginners Guide To Python 3 Programming
96% (47)
(Hunt, J.) A Beginners Guide To Python 3 Programming
440 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Beginners Python Cheat Sheet
89% (9)
Beginners Python Cheat Sheet
28 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
100 Skills To Better Python
100% (10)
100 Skills To Better Python
80 pages
Linked Data Visualization 1st Edition Laura Po Instant Download
100% (1)
Linked Data Visualization 1st Edition Laura Po Instant Download
21 pages
Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Faceted Exploration of Multiple RDF Data Sources Using SPARQL
No ratings yet
Faceted Exploration of Multiple RDF Data Sources Using SPARQL
84 pages
Final Internship Report BSC Csit
0% (6)
Final Internship Report BSC Csit
42 pages
Web Semantics: Science, Services and Agents On The World Wide Web
100% (1)
Web Semantics: Science, Services and Agents On The World Wide Web
22 pages
Scoping Questionnaire For Penetration Testing: # Questions Answer Comments
No ratings yet
Scoping Questionnaire For Penetration Testing: # Questions Answer Comments
7 pages
Cyber Secruity MCQS
100% (3)
Cyber Secruity MCQS
9 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
One Year P.G Diploma Course in Cyber Security
No ratings yet
One Year P.G Diploma Course in Cyber Security
56 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
The Internet Is A Global System of Interconnected Computer Networks That Use The Standard Internet Protocol Suite
No ratings yet
The Internet Is A Global System of Interconnected Computer Networks That Use The Standard Internet Protocol Suite
2 pages
Semantic Web
No ratings yet
Semantic Web
86 pages
PDF Reader 7.11.60 Backup Export All - Json
100% (1)
PDF Reader 7.11.60 Backup Export All - Json
7 pages
Semantic Hierarchies For Image Annotation - A Survey
No ratings yet
Semantic Hierarchies For Image Annotation - A Survey
41 pages
Referensi UI/UX Indonesia
No ratings yet
Referensi UI/UX Indonesia
3 pages
ICT Worksheet For Grade 12
No ratings yet
ICT Worksheet For Grade 12
4 pages
Network Protocols and Architecture: (Application Layer and Services)
No ratings yet
Network Protocols and Architecture: (Application Layer and Services)
15 pages
FULLTEXT02
No ratings yet
FULLTEXT02
89 pages
SCMILIT L13-14 Manipulative and Multimedia Information and Media
No ratings yet
SCMILIT L13-14 Manipulative and Multimedia Information and Media
4 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Encryption As A Service (EaaS) : Introducing The Full-Cloud-Fog Architecture For Enhanced Performance and Security
No ratings yet
Encryption As A Service (EaaS) : Introducing The Full-Cloud-Fog Architecture For Enhanced Performance and Security
23 pages
QR Code Samples: Windjack Solutions, Inc
No ratings yet
QR Code Samples: Windjack Solutions, Inc
2 pages
CONCEPT MAP - Operating System
No ratings yet
CONCEPT MAP - Operating System
1 page
Tech Note - FBB - E325727 - FleetOne - SW - Release - 126 Rev - A
No ratings yet
Tech Note - FBB - E325727 - FleetOne - SW - Release - 126 Rev - A
3 pages
Service Management - Zoho Desk Scope
No ratings yet
Service Management - Zoho Desk Scope
2 pages
Browser Object Model
No ratings yet
Browser Object Model
18 pages
Class 5-6
No ratings yet
Class 5-6
60 pages
Assignment (Chapter 9)
No ratings yet
Assignment (Chapter 9)
17 pages
Fundamentals of Neural Networks PDF
100% (7)
Fundamentals of Neural Networks PDF
476 pages
Wireshark Tutorial: References To Figures and Sections Are For The 6 Edition of Our Text, Computer Networks, A Top-Down
No ratings yet
Wireshark Tutorial: References To Figures and Sections Are For The 6 Edition of Our Text, Computer Networks, A Top-Down
8 pages
Causal Argument
No ratings yet
Causal Argument
7 pages
Assignment 01 Wireshark
No ratings yet
Assignment 01 Wireshark
9 pages
Networking Basics of Java
No ratings yet
Networking Basics of Java
3 pages
Python + OpenAI Powered Humanoid AI Desktop Assistant Robot Guided by - Dr. S. Malathi - Ashwin M - 2020PECAI130 - Rijo Benny - 2020PECAI152
No ratings yet
Python + OpenAI Powered Humanoid AI Desktop Assistant Robot Guided by - Dr. S. Malathi - Ashwin M - 2020PECAI130 - Rijo Benny - 2020PECAI152
10 pages
Unit 5 - Digital Marketing Channels
No ratings yet
Unit 5 - Digital Marketing Channels
11 pages
Schaeffler Online Application
No ratings yet
Schaeffler Online Application
12 pages
Client Server PDF
No ratings yet
Client Server PDF
4 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
BrightSign PlayerSecurityStatement
No ratings yet
BrightSign PlayerSecurityStatement
5 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
Chapter 5 - Questions and Answers
No ratings yet
Chapter 5 - Questions and Answers
5 pages
Web Mining Using Artificial Ant Colonies: A Survey
No ratings yet
Web Mining Using Artificial Ant Colonies: A Survey
6 pages
Bb-Mastering Structured Data On The Semantic Web
No ratings yet
Bb-Mastering Structured Data On The Semantic Web
244 pages
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
No ratings yet
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
34 pages
Home Computer Services BizHouse - Uk
No ratings yet
Home Computer Services BizHouse - Uk
3 pages
History Webliography
100% (7)
History Webliography
43 pages