Module VI Link Analysis final.pptx
Module VI Link Analysis final.pptx
● Links as Votes:
When one page links to another, it's like casting a vote for that page.
● Weighted Votes:
Not all votes are equal. Links from important pages carry more weight than
those from less important pages.
● PageRank Calculation:
The PageRank of a page is calculated by summing the PageRank of all
pages linking to it, divided by the number of outgoing links from those pages.
● Random Surfer Model:
The algorithm simulates a random surfer who clicks on links randomly,
eventually settling on pages with high PageRank.
● Damping Factor:
A damping factor (typically 0.85) is used to account for the fact that users
don't always follow links and might instead choose to randomly jump to
another page.
PageRank was developed by Larry Page and Sergey Brin, the founders of
Google, while they were Ph.D. students at Stanford University in the late
1990s. The idea behind PageRank is seemingly simple: it ranks web pages
based on their importance, as determined by the number and quality of links
to them.
This allows the algorithm to assign a weight between 0 and 1 that any
particular page will be the next one.
PageRank then uses the weight of that particular page (i.e., how likely it is to
be clicked) to assign an even weight to each of its outgoing links.
For example, if a page with a calculated PageRank value of .25 links to two
other pages, it would confer half of its PageRank score to each of those
pages
Where the PageRank (PR(u)) of any particular page is the summation
of all pages linking to the page’s PageRank divided by the number of
links each has.
1. Iteration 0: Initialize all ranks to be 1/(number of total
pages).
2. Iteration 1: For each page u, update u’s rank to be
the sum of each incoming page v’s rank from the
previous iteration, divided by the number total number
of links from page v.
1. Iteration 0: Initialize all ranks to be 1/(number of total
pages).
2. Iteration 1: For each page u, update u’s rank to be
the sum of each incoming page v’s rank from the
previous iteration, divided by the number total number
of links from page v.
Example 1
• Page Rank helps in measuring the relative
importance of the document within the set of
similar documents.
• A numeric weight assigned to any given
element E is referred as the the pageRank of E
and denoted by PR(E).
• PageRank of Page indicates importance :higher
the value more important is web page.
• A hyperlink to the page count as a vote or
support.
• The page rank of page is defined recursively
and depends on the number and PageRank
metric of all the pages that links to
it(“incoming links”).
• A page that is linked to by many pages with
high PageRank receives high rank itself.
Link Structure of the Web
• 150 million web pages 🡪 1.7 billion links
Vo = 1/n
1/n
1/n
Page Rank Concept
2. Build M , the transition Matrix .
This matrix has n rows and n columns , if there
are n pages. The element mij in row I and column j
has value 1/K if page j has k arcs out and one of them
is to page I, otherwise mij =0.
(number of outlinks from from J and one of them is to page I.)
Mij Cell value indicates the probability of a random surfer to reach page I from node
J.
Page Rank Concept
A B C D
• 1. We can drop the dead ends from the graph, and also drop
their incoming arcs. Doing so may create more dead ends,
which also have to be dropped, recursively. However,
eventually we wind up with a strongly-connected
component, none of whose nodes are dead ends. In terms
of recursive deletion of dead ends will remove parts of the
out-component, tendrils, and tubes, but leave the SCC and
the in-component, as well as parts of any small isolated
components.
• Solution : Teleporting
• Surfer jumps from one page to a random page,
rather than following an out-link from their
current page.
Random Walks in Graphs
• The Random Surfer Model
– The simplified model: the standing probability
distribution of a random walk on the graph of the
web. simply keeps clicking successive links at
random
• The Modified Model
– The modified model: the “random surfer” simply
keeps clicking successive links at random, but
periodically “gets bored” and jumps to a random
page based on the distribution of E
Modified page rank
An example of Modified PageRank
37
Dangling Links
• Links that point to any page with no outgoing links
• Most are pages that have not been downloaded
yet
• Affect the model since it is not clear where their
weight should be distributed
• Do not affect the ranking of any other page directly
di … out-degree
of node i
[1/N]NxN…N by N matrix
where all entries are 1/N
15
7/1
5
7/1
1/
15
• Identify the good pages and the spam pages in the seed set
– Expensive task, so we must make seed set as small as possible
• Call the subset of seed pages that are identified as good the trusted pages
Solutions - For Link Spam
1. Spam Mass
• In the TrustRank model, we start with good pages
and propagate trust
• Complementary view:
What fraction of a page’s PageRank comes from
spam pages?
HITS: Hubs and Authorities
Hubs and Authorities
• HITS (Hypertext-Induced Topic Selection)
– Is a measure of importance of pages or
documents, similar to PageRank
– Proposed at around same time as PageRank (‘98)
• Goal: Say we want to find good newspapers
– Don’t just find newspapers. Find “experts” – people
who link in a coordinated way to good newspapers
• Idea: Links as votes
– Page is more important if it has more links
• In-coming links? Out-going links?
J. Leskovec, A. Rajaraman, J. Ullman: Mining
60
of Massive Datasets, http://www.mmds.org
Finding newspapers
NYT: 10
• Hubs and Authorities
Each page has 2 scores: Ebay: 3
66
A Simple Example
Update Authority Scores first
1/3.875 = 0.258
3/3.875 =0.775
A Simple Example
( normalization of Hub -1)
Sum of Squares: 59
A Simple Example
( normalization of Hub -2)