0% found this document useful (0 votes)
9 views104 pages

Module VI Link Analysis final.pptx

Link analysis is used to create connections within data sets, represented as networks, such as the World Wide Web where web pages are nodes and hyperlinks are edges. PageRank, developed by Google founders, ranks web pages based on the quantity and quality of links, treating links as votes for importance, with a damping factor to account for random page jumps. Challenges such as dead ends and spider traps are addressed through methods like taxation and random teleports to ensure accurate PageRank calculations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views104 pages

Module VI Link Analysis final.pptx

Link analysis is used to create connections within data sets, represented as networks, such as the World Wide Web where web pages are nodes and hyperlinks are edges. PageRank, developed by Google founders, ranks web pages based on the quantity and quality of links, treating links as votes for importance, with a damping factor to account for random page jumps. Challenges such as dead ends and spider traps are addressed through methods like taxation and random teleports to ensure accurate PageRank calculations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Link Analysis

• Purpose of Link Analysis :


• The purpose of link analysis is to create connections in a
set of data that can actually be represented as networks or
the networks of information.
• For example, in Internet, computers or routers
communicate with each other which can be represented
as a dynamic network of nodes which represent the
computer and routers.
• The edges of the network are physical links between these
machines.
• Another example is the web which can be represented as a
graph
• Representation of the World Wide Web (WWW)
as a graph WWW can be represented as a
directed graph.
• So, in the graph, nodes will correspond to web
pages.
• Therefore, every web page will be a node in this
graph and direct links between these web pages
correspond to hyperlinks.
• These hyperlink relationships can be used to
create a network
PAGE RANKING
• PageRank is an algorithm that was designed
for the initial implementation of the Google
search engine.
• The PageRank is defined as an approach for
computing the importance of the web pages
in a big web graph.
• PageRank – Basic Concept:
• A page or a node in a graph is as important as the
number of links it has.
• Not all in-links are equal kind of links.
• The links coming from important pages are worth
more.
• The importance of a page depends on the importance
of other pages that points to it.
• Thus, the importance of a given pointed page depends
on the importance of the others, and its importance
then gets passed on further through the graph.
• DIFFERENT MECHANISMS OF FINDING
PAGERANK
PageRank Defintion
PageRank(PR) is the number of
points out of ten that signifies the
importance of a site to Google.
PageRank is a function that assigns a real number to each page
in the Web (or at least to that portion of the Web that has
been crawled and its links discovered). The intent is that the
higher the Page Rank of a page, the more “important” it is.
● The PageRank algorithm, a core part of Google's search engine,
determines the importance of web pages based on the quantity and
quality of links pointing to them, essentially acting like a democratic
voting system where links are votes.

● Links as Votes:
When one page links to another, it's like casting a vote for that page.
● Weighted Votes:
Not all votes are equal. Links from important pages carry more weight than
those from less important pages.
● PageRank Calculation:
The PageRank of a page is calculated by summing the PageRank of all
pages linking to it, divided by the number of outgoing links from those pages.
● Random Surfer Model:
The algorithm simulates a random surfer who clicks on links randomly,
eventually settling on pages with high PageRank.
● Damping Factor:
A damping factor (typically 0.85) is used to account for the fact that users
don't always follow links and might instead choose to randomly jump to
another page.
PageRank was developed by Larry Page and Sergey Brin, the founders of
Google, while they were Ph.D. students at Stanford University in the late
1990s. The idea behind PageRank is seemingly simple: it ranks web pages
based on their importance, as determined by the number and quality of links
to them.

● PageRank treated the burgeoning


internet like a graph, with pages as
nodes and hyperlinks as connections
between them.
● Links were essentially a voting system
- each link to a page was a vote for its
importance - but not all votes were
equal.
● Links from more important pages (i.e.,
those with more of their own links in)
were given more weight than those
from less important pages (i.e., those
with fewer links into them).
Understanding PageRank

Fundamentally, here’s how the PageRank calculation works:

Given a defined set of linked web pages, PageRank calculates the


probability that a person who clicks any single, random link will end up on a
particular page.

This allows the algorithm to assign a weight between 0 and 1 that any
particular page will be the next one.

PageRank then uses the weight of that particular page (i.e., how likely it is to
be clicked) to assign an even weight to each of its outgoing links.

For example, if a page with a calculated PageRank value of .25 links to two
other pages, it would confer half of its PageRank score to each of those
pages
Where the PageRank (PR(u)) of any particular page is the summation
of all pages linking to the page’s PageRank divided by the number of
links each has.
1. Iteration 0: Initialize all ranks to be 1/(number of total
pages).
2. Iteration 1: For each page u, update u’s rank to be
the sum of each incoming page v’s rank from the
previous iteration, divided by the number total number
of links from page v.
1. Iteration 0: Initialize all ranks to be 1/(number of total
pages).
2. Iteration 1: For each page u, update u’s rank to be
the sum of each incoming page v’s rank from the
previous iteration, divided by the number total number
of links from page v.
Example 1
• Page Rank helps in measuring the relative
importance of the document within the set of
similar documents.
• A numeric weight assigned to any given
element E is referred as the the pageRank of E
and denoted by PR(E).
• PageRank of Page indicates importance :higher
the value more important is web page.
• A hyperlink to the page count as a vote or
support.
• The page rank of page is defined recursively
and depends on the number and PageRank
metric of all the pages that links to
it(“incoming links”).
• A page that is linked to by many pages with
high PageRank receives high rank itself.
Link Structure of the Web
• 150 million web pages 🡪 1.7 billion links

Backlinks and Forward links:


A and B are C’s backlinks
C is A and B’s forward link

Intuitively, a webpage is important if


it has a lot of backlinks.
Page Rank Concept

A tiny version of Web.


Page Rank Concept
1. Initially the surfer can be at any of the n
pages with probability 1/n . We denote as
follows

Vo = 1/n
1/n

1/n
Page Rank Concept
2. Build M , the transition Matrix .
This matrix has n rows and n columns , if there
are n pages. The element mij in row I and column j
has value 1/K if page j has k arcs out and one of them
is to page I, otherwise mij =0.
(number of outlinks from from J and one of them is to page I.)

Mij Cell value indicates the probability of a random surfer to reach page I from node
J.
Page Rank Concept
A B C D

This probability is the (idealized) PageRank function.


Page Rank Concept
2. Build M , the transition Matrix (Contiuation)
Markov Chain Process or Markov Transition
Matrix
The matrix has two properties :
1.The sum of entries of any column of the
matrix M is always equal to 1.
2. All entries have value greater or equal to
zero.
Page Rank Concept

3. . If vector V shows the probability distribution


for the current location ,we can use v and M to
get the distribution vector for the next state x
v=Mv .

4. After two steps it will be M(Mv0) = M2v0,


and so on
An example of Simplified PageRank

PageRank Calculation: first iteration


An example of Simplified PageRank

PageRank Calculation: second iteration


An example of Simplified PageRank

Convergence after some iterations


Structure of Web
Problems Faced by PageRank

• There are really two problems we need to avoid.

• 1.Dead End : A page that has no links out. Surfers


reaching such a page disappear, and the result is that in
the limit no page that can reach a dead end can have
any PageRank at all.

• 2.Spider Traps : Groups of pages that all have outlinks


but they never link to any other pages. These structures
are called spider traps.
.
Dead End Example
Solutions
• There are two approaches to dealing with dead ends.

• 1. We can drop the dead ends from the graph, and also drop
their incoming arcs. Doing so may create more dead ends,
which also have to be dropped, recursively. However,
eventually we wind up with a strongly-connected
component, none of whose nodes are dead ends. In terms
of recursive deletion of dead ends will remove parts of the
out-component, tendrils, and tubes, but leave the SCC and
the in-component, as well as parts of any small isolated
components.

• 2. Use taxation Method


• Taxation:
• We can modify the process by which random surfers are assumed
to move about the Web.
• This method which we refer to as 'taxation' also solves the
problem of spider traps.
• Here, we modify the calculation of Page Rank by allowing each
random suffer a small probability of teleporting to a random page,
rather than following an out-link from their current page, The
iterative step where we compute a new vector estimate of Page
Rank v' from the current Page Rank estimate v and the transition
matrix 'M' is,

where β is the chosen constant usually in the range of 0.8 to 0.9,


e is a vector of all 1's with the appropriate number of components,
n is the number of nodes in the Web graph.
Spider Trap
• A spider trap is a set of nodes with no dead ends
but no arcs out. These structures can appear
intentionally or unintentionally on the Web, and
they cause the PageRank calculation to place all
the PageRank within the spider traps.

• Solution : Teleporting
• Surfer jumps from one page to a random page,
rather than following an out-link from their
current page.
Random Walks in Graphs
• The Random Surfer Model
– The simplified model: the standing probability
distribution of a random walk on the graph of the
web. simply keeps clicking successive links at
random
• The Modified Model
– The modified model: the “random surfer” simply
keeps clicking successive links at random, but
periodically “gets bored” and jumps to a random
page based on the distribution of E
Modified page rank
An example of Modified PageRank

37
Dangling Links
• Links that point to any page with no outgoing links
• Most are pages that have not been downloaded
yet
• Affect the model since it is not clear where their
weight should be distributed
• Do not affect the ranking of any other page directly

Can be simply removed before pagerank calculation


and added back afterwards
PageRank:
The Google Formulation
Solution: Random Teleports

di … out-degree
of node i

J. Leskovec, A. Rajaraman, J. Ullman: Mining


40
of Massive Datasets, http://www.mmds.org
The Google Matrix

[1/N]NxN…N by N matrix
where all entries are 1/N

J. Leskovec, A. Rajaraman, J. Ullman: Mining


41
of Massive Datasets, http://www.mmds.org
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5

15
7/1

5
7/1

1/
15

y 7/15 7/15 1/15


13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
42
PageRank Implementation
• Convert each URL into a unique integer and store each
hyperlink in a database using the integer IDs to identify
pages
• Sort the link structure by ID
• Remove all the dangling links from the database
• Make an initial assignment of ranks and start iteration
■ Choosing a good initial assignment can speed up the pagerank

• Adding the dangling links back.


Page Rank Implementation Using Map
Reduce
Map :

map( key: [url, pagerank], value: outlink_list )


for each outlink in outlink_list
emit( key: outlink, value:pagerank/size (outlink_list) )
emit( key: url, value: outlink_list )
Page Rank Implementation Using Map
Reduce
Reducer

reducer( key: url, value: list_pr_or_urls )


outlink_list = []
pagerank = 0
for each pr_or_urls in list_pr_or_urls
if is_list( pr_or_urls )
outlink_list = pr_or_urls
else pagerank += pr_or_urls pagerank = 1 - DAMPING_FACTOR + (
DAMPING_FACTOR * pagerank )
function Mapper
input -> PageA, PageB, PageC...
begin
Nn := the number of outlinks for PageN;
for each outlink PageK output PageK->
// output the outlinks for PageN.
output PageN-> PageA, PageB, PageC…
end

map( key: [url, pagerank], value: outlink_list )


for each outlink in outlink_list
emit( key: outlink, value:pagerank/size (outlink_list) )
emit( key: url, value: outlink_list )
reducer( key: url, value: list_pr_or_urls )
outlink_list = []
pagerank = 0
for each pr_or_urls in list_pr_or_urls
if is_list( pr_or_urls )
outlink_list = pr_or_urls
else pagerank += pr_or_urls pagerank = 1 - DAMPING_FACTOR + (
DAMPING_FACTOR * pagerank )
What is Web Spam?
• Spamming:
– Any deliberate action to boost a web
page’s position in search engine results
• Spam:
– Web pages that are the result of spamming
• Approximately 10-15% of web pages are
spam

J. Leskovec, A. Rajaraman, J. Ullman: Mining


50
of Massive Datasets, http://www.mmds.org
Web Search
• Early search engines:
– Crawl the Web
– Index pages by the words they contained
– Respond to search queries (lists of words) with the
pages containing those words
• Early page ranking:
– Attempt to order pages matching a search query by
“importance”
– First search engines considered:
• (1) Number of times query words appeared
• (2) Prominence of word position, e.g. title, header

J. Leskovec, A. Rajaraman, J. Ullman: Mining


51
of Massive Datasets, http://www.mmds.org
First Spammers
• As people began to use search engines to find
things on the Web, those with commercial
interests tried to exploit search engines to
bring people to their own site – whether they
wanted to be there or not
• Example:
– Shirt-seller might pretend to be about “movies”
• Techniques for achieving high
relevance/importance for a web page

J. Leskovec, A. Rajaraman, J. Ullman: Mining


52
of Massive Datasets, http://www.mmds.org
First Spammers: Term Spam
• How do you make your page appear to be about
movies?
– (1) Add the word movie 1,000 times to your page
– Set text color to the background color, so only search
engines would see it
– (2) Or, run the query “movie” on your
target search engine
– See what page came first in the listings
– Copy it into your page, make it “invisible”
• These and similar techniques are term spam

J. Leskovec, A. Rajaraman, J. Ullman: Mining


53
of Massive Datasets, http://www.mmds.org
Google’s Solution to Term Spam
• Believe what people say about you, rather
than what you say about yourself
– Use words in the anchor text (words that appear
underlined to represent the link) and its
surrounding text

• PageRank as a tool to measure the


“importance” of Web pages

J. Leskovec, A. Rajaraman, J. Ullman: Mining


54
of Massive Datasets, http://www.mmds.org
Link Spam
The techniques for artificially increasing the
PageRank of a page are collectively called link
spam.
Architecture of a Spam Farm
• A collection of pages whose purpose is to
increase the PageRank of a certain page or
pages is called a spam farm.
Solutions - For Link Spam
1. Trust Rank
TrustRank = topic-specific PageRank with a teleport set of trusted pages
• Example: .edu domains ,gov domain

• Basic principle: Approximate isolation


– It is rare for a “good” page to point to a “bad” (spam) page

• Sample a set of seed pages from the web

• Identify the good pages and the spam pages in the seed set
– Expensive task, so we must make seed set as small as possible

• Call the subset of seed pages that are identified as good the trusted pages
Solutions - For Link Spam
1. Spam Mass
• In the TrustRank model, we start with good pages
and propagate trust

• Complementary view:
What fraction of a page’s PageRank comes from
spam pages?
HITS: Hubs and Authorities
Hubs and Authorities
• HITS (Hypertext-Induced Topic Selection)
– Is a measure of importance of pages or
documents, similar to PageRank
– Proposed at around same time as PageRank (‘98)
• Goal: Say we want to find good newspapers
– Don’t just find newspapers. Find “experts” – people
who link in a coordinated way to good newspapers
• Idea: Links as votes
– Page is more important if it has more links
• In-coming links? Out-going links?
J. Leskovec, A. Rajaraman, J. Ullman: Mining
60
of Massive Datasets, http://www.mmds.org
Finding newspapers
NYT: 10
• Hubs and Authorities
Each page has 2 scores: Ebay: 3

– Quality as an expert (hub): Yahoo: 3

• Total sum of votes of authorities pointed to CNN: 8


• (Counts outlink)
WSJ: 9
– Quality as a content (authority):
• Total sum of votes coming from experts
• (Counts inlink)
• Principle of repeated improvement
J. Leskovec, A. Rajaraman, J. Ullman: Mining
61
of Massive Datasets, http://www.mmds.org
Hubs and Authorities
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
– Newspaper home pages
– Course home pages
– Home pages of auto manufacturers

2. Hubs are pages that link to authorities


– List of newspapers
– Course bulletin
– List of US auto manufacturers

J. Leskovec, A. Rajaraman, J. Ullman: Mining


65
of Massive Datasets, http://www.mmds.org
Mutually Recursive Definition
• A good hub links to many good authorities

• A good authority is linked from many good


hubs

66
A Simple Example
Update Authority Scores first

1.000 1.000 1.000


1.000 1.000 1.000

1.000 1.000 Auth


1.000 1.000 Key:
Hub
A Simple Example
Update Authority Scores first, using Hub scores
One incoming edge

1.000 1.000 1.000


1.000 1.000 1.000

1.000 1.000 Auth


1.000 1.000 Key:
Hub
A Simple Example
Update Authority Scores first

1.000 3.000 1.000


1.000 1.000 1.000

0.000 1.000 Auth


1.000 1.000 Key:
Hub

Two incoming edges


A Simple Example
Update Authority Scores first

1.000 3.000 1.000


1.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


1.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 1.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


1.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


6.000 1.000 Key:
Hub
A Simple Example

Update Hub Scores using new Authority Scores

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


6.000 1.000 Key:
Hub
A Simple Example

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


6.000 3.000 Key:
Hub
A Simple Example
( normalization of Authority -1)

1.000 3.000 1.000


3.000 1.000 2.000

0.000 2.000 Auth


6.000 3.000 Key:
Hub

Sum of Squares: ( 1 +9 +1+ 4) =15.000


A Simple Example
( normalization of Authority -2)

0.258 0.775 0.258


3.000 1.000 2.000

0.000 0.516 Auth


6.000 3.000 Key:
Hub

Divide By: 3.873 (sqrt(15))

1/3.875 = 0.258
3/3.875 =0.775
A Simple Example
( normalization of Hub -1)

0.258 0.775 0.258


3.000 1.000 2.000

0.000 0.516 Auth


6.000 3.000 Key:
Hub

Sum of Squares: 59
A Simple Example
( normalization of Hub -2)

0.258 0.775 0.258


0.391 0.130 0.260

0.000 0.516 Auth


0.781 0.391 Key:
Hub

Divide By: 7.681 (sqrt(59))


A Simple Example

0.258 0.775 0.258


0.391 0.130 0.260

0.000 0.516 Auth


0.781 0.391 Key:
Hub

After First Iteration


A Simple Example

0.383 0.767 0.064


0.374 0.031 0.249

0.000 0.511 Auth


0.811 0.374 Key:
Hub

After Second Iteration


A Simple Example

0.395 0.760 0.015


0.370 0.007 0.252

0.000 0.517 Auth


0.814 0.370 Key:
Hub

After Third Iteration


HITS Algorithm
PageRank and HITS
• PageRank and HITS are two solutions to the
same problem:
– What is the value of an in-link from u to v?
– In the PageRank model, the value of the link
depends on the links into u
– In the HITS model, it depends on the value of the
other links out of u

• The destinies of PageRank and HITS


post-1998 were very different
J. Leskovec, A. Rajaraman, J. Ullman: Mining
104
of Massive Datasets, http://www.mmds.org

You might also like