Social Network Analysis
Social Hub
Garvit Vijayvargia(19ucs201)
Tanush Garg(19ucs203)
Abhishek Kumar(19ucs241)
Abhinav Agarwal(19ucs254)
Course Instructor : Dr. Sakthi Balan
Department of Computer Science and Engineering
The LNM Institute of Information Technology, Jaipur
Round 1
Code - Drive Link
DATA SET 1
Twitter(ICWSM)
About dataset
ATTRIBUTES VALUES
Node meaning User
Edge meaning Follow
Network format Unipartite, directed
Edge type Unweighted, no multiple edges
Reciprocal Contains Reciprocal edges
Directed Cycles Contains Directed cycles
Loops Does not contain loops
Dataset Statistics
ATTRIBUTES VALUES
Size N = 465017
Volume M =834,797
Diameter 8 edges
Network Description
Twitter is a giant social media platform, where anyone can share their
opinions. Here people can follow anyone which results in the formation of a
massive directed network, basically, the network contains information about
who follows whom on Twitter, where nodes are the user and there is a
directed edge from the user to another user whom he/she wants to follow.
Libraries Used
NetworkX
NetworkX is a Python library used for creating and manipulating complex
graphs and networks. Using NetworkX we can generate many types of
random and classic networks, analyze network structure, build network
models, design new network algorithms and draw networks .
Matplotlib
Matplotlib is also a python library that is used for plotting. We can use this for
data visualization and graphical plotting, it is a good open-source alternative
to MATLAB.
Problem Statement
Find all centrality measures, clustering coefficients (both local and global),
reciprocity and transitivity that we have studied in the class using appropriate
algorithms.
We will import the datasets and then we will create the graph .
Calculating Degree Centrality
This is the measure of centrality of nodes with respect to their edges
IN DEGREE CENTRALITY:
OUT DEGREE CENTRALITY:
Here, incoming edges denote the followers of that user. The out-going edges
denote the follower of that user. The above values of degree centrality are
normalized by n-1 (n is number of nodes). Therefore, we can say that a user on
average has 1.7951 follower and 1.795 followee and that the user
corresponding to node 643 has the maximum follower with value=198.99 and
the user corres thiponding to node 3418 has the maximum followed with
value 499.99 while the user corresponding to node 2555 has the lowest
number follower with value =0 and the user corresponding to node 2 has the
lowest number followed with value = 0.
Calculating Eigenvector Centrality
This Centrality of the node is measured by using the importance of the
neighboring nodes.
From the above data we can see that node has the 5898 maximum
eigenvector centrality with value 0.04367 is while node 2555 has the lowest
eigenvector centrality with value = 6.8241*10 -17 . The average eigenvector
centrality of the network is 0.00050278.
Calculating Katz Centrality
This Biases the eigenvector centrality to avoid zero centrality values for nodes
with no incoming edges.
As per the above graph, the minimum closeness centrality is1.438503*10 -3 of
node 2555, the maximum closeness centrality is1.438503*10 -3 of node 643 and
the average closeness centrality of the network is 1.4655*10 -3
Calculating Betweenness Centrality
This centrality is computed on the basis of how important a node is in
connecting the other nodes
As per the above graph, the minimum closeness centrality is 0.0 of node 2,
the maximum closeness centrality is 3.65274*10 -2 of node 23719 and the
average closeness centrality of the network is 4.8328*10 -8
The node (user) which has the maximum value of betweenness centrality has
the most influence on the flow around a system. This user is the most central
in connecting other users of the network and hence holds authority over
disparate clusters in a network, or is on the periphery of most clusters .
Calculating Closeness Centrality
Centrality is measured in terms of average shortest path length of the nodes.
The node (user) having the maximum closeness centrality value is the best placed to
influence the entire network most quickly. This user can spread information to other
users of the network much faster than any other node and are “the broadcasters” of the
Network.
As per the above graph, the minimum closeness centrality is 0.0 of node 255, the
maximum closeness centrality is 1.3035267*10 -3 of node 643 and the average closeness
centrality of the network is 5.5425*10 -4.
Calculating Clustering Coefficient:
The above graph shows the local clustering coefficient for all the nodes. The local
clustering coefficient measures the neighbors’ level connectivity of a given node. In
the network, the user represented by node has the minimum clustering coefficient.
The average local clustering coefficient of the above network is 1.03936*10 -2
Calculating Transitivity & Reciprocity
Transitivity of the graph is the clustering coefficient of the network at the global level
and it is equal to 0.0001325. This value tells us the tendency of forming clusters
across the entire network.
Reciprocity is the fraction of reciprocal edges over all [Link] is 0.0030115 for this
network.
DATA SET 2: Flixster
About DataSet:
[Link]
Data Set Description:
ATTRIBUTES VALUES
Node meaning User
Edge meaning Friendship
Network format Unipartite, undirected
Edge type Unweighted, no multiple edges
Loops Does not contain loops
Size n=2523386
Volume m=7918801
Loop Count 0
About Network :
It was an American Social Networking movie website for discovering new movies,
meeting others with similar tastes in [Link] site allowed users to see movie
trailers as well as learn about new and upcoming movies in the box office.
Libraries Used:
NetworkX :
NetworkX is a Python package for the creation, manipulation, and study of the
structure, dynamics, and functions of complex networks. With NetworkX, we can
load and store networks in standard and nonstandard data formats, generate many
types of random and classic networks, analyze network structure, build network
models, design new network algorithms, draw networks, and much more.
MatplotLib:
Library used for plotting graphs.
Centrality Measure Studied:
Calculating Degree Centrality
Degree centrality is defined as the number of links incident upon a node. If the
network is directed , then two separate measures of degree centrality are defined,
namely, indegree and outdegree. Indegree is a count of the number of ties directed
to the node and outdegree is the number of ties that the node directs to others. In
such cases, the degree is the sum of indegree and outdegree.
Measures centrality of nodes with respect to their degree.
Here, edges denote friendship among the users. The above values of degree
centrality are normalized by n-1 ≈ 2.52*10 5 (maximum possible degree). Therefore,
we can say that a user on average has one friend ( 2.49 * 10 -6 * 2.52 * 10 5 ) and that
the user corresponding to node 67 has the maximum friends: around 146 (
0.00058*2.52*105 ) while the user corresponding to node 45821 has the lowest
number of friends: 0 (3.96*10 -7*2.52*10 5 ).
Calculating Katz Centrality
Katz centrality of a node is a measure of centrality in a network. It is used to
measure the relative degree of influence of a node within a social network. Katz
centrality measures influence by taking into account the total number of walks
between a pair of nodes. Biases the eigenvector centrality to avoid zero centrality
values for nodes with no incoming edges.
Computation can not be done because the run-time was too long.
Calculating Betweenness Centrality
It is a measure of centrality based on shortest paths. For every pair of vertices in a
connected graph, there exists at least one shortest path between the vertices such
that either the number of edges that the path passes through (for unweighted
graphs) or the sum of the weights of the edges (for weighted graphs) is minimized.
The betweenness centrality for each vertex is the number of these shortest paths
that pass through the vertex. Centrality is computed on the basis of how important a
node is in connecting the other nodes.
Computation can not be done because the run-time was too long. The node which
has the maximum value of betweenness centrality has the most influence on the
flow around a system. The user, the most central in connecting other users of the
network and hence holds authority over disparate clusters in a network.
Calculating Eigenvector Centrality
It is an algorithm that measures the transitive influence of nodes. Relationships
originating from high-scoring nodes contribute more to the score of a node than
connections from low-scoring nodes. A high eigenvector score means that a node is
connected to many nodes who themselves have high scores.
Centrality of the node is measured by using the importance of the neighboring
nodes.
We can see that node 38022 has the maximum eigenvector centrality while node
1504238 has the lowest eigenvector centrality. The average eigenvector centrality of
the network is 8.81*10 -5 .
Calculating Closeness Centrality
In a connected graph, closeness centrality of a node is a measure of centrality in a
network, calculated as the reciprocal of the sum of the length of the shortest paths
between the node and all other nodes in the graph. Thus, the more central a node is,
the closer it is to all other nodes. Centrality is measured in terms of average shortest
path length of the nodes.
Computation can not be done because the run-time was too long. The node (user)
having the maximum closeness centrality value is the best placed to influence the
entire network most quickly. This user can spread information to other users of the
network much faster than any other node and are “the broadcasters” of the network.
Calculating Clustering Coefficient
In graph theory, a clustering coefficient is a measure of the degree to which nodes in
a graph tend to cluster together.
The above graph shows the local clustering coefficient for all the nodes. The local
clustering coefficient measures the neighbors’ level connectivity of a given node. In
the network, the user represented by node 49 has the minimum clustering
coefficient of 0.
The average local clustering coefficient of the above network is 0.0833.
Calculating Transitivity & Reciprocity
Transitivity of the graph is the clustering coefficient of the network at the global level
9 and it is equal to 0.01365. This value tells us the tendency of forming clusters
across the entire network. Reciprocity is not to be studied in case of an undirected
graph.
Inferences
● The reciprocity value as seen above is 0 for the flixster network since it
is an undirected network thus, here by default all the connections are
mutual, whereas, since the Twitter Network is directed hence, it has
some reciprocity value.
● Flixster has more Transitivity than Twitter hence there are more
common friends.
● Flixster has a more compact network as their closeness centrality is
more.
Round 2
Code - Drive Link
Problem-1:
Using the two datasets that you had selected for the first round and the
corresponding networks that you had generated from those two datasets do
the following:
1. Find the giant component G in both the networks (note that giant
components it the largest connected subgraph the constricted
network/graph). Let N_G denote the number of nodes in G. Find N_G/N
where N is the total number of nodes in the network.
FLIXSTER:
RESULT:
EDGES (E) NODES (N) NODES IN GIANT NG/N
COMPONENT(NG)
100000 121436 45012 0.37066438
600000 460843 412413 0.8949099
1100000 706900 665468 0.9413891
1600000 913154 877531 0.960989
2100000 1094699 1064427 0.9723467
2600000 1258665 1233120 0.9797046
3100000 1412518 1391245 0.9849396
3600000 1553821 1536464 0.9888294
4100000 1686293 1672121 0.9915957
4600000 1813140 1802183 0.9939568
5100000 1932747 1923966 0.9954567
5600000 2047314 2040141 0.9964963
6100000 2157254 2152267 0.9976882
6600000 2262252 2258709 0.9984338
7100000 2363449 2361464 0.9991601
7600000 2462666 2462004 0.9997311
TWITTER:
RESULT:
EDGES (E) NODES (N) NODES IN GIANT NG/N
COMPONENT(NG)
100000 82390 80087 0.97204757
160000 123058 121887 0.99048416
220000 160946 160097 0.99472493
280000 196260 196019 0.99877203
340000 229808 229666 0.99938209
400000 261884 261497 0.99852224
460000 292858 292766 0.99968585
520000 322496 322470 0.99991937
580000 351841 351807 0.99990336
640000 379257 379233 0.99993671
700000 406582 406569 0.99996802
760000 432983 432980 0.9999930
820000 458697 458692 0.99998909
Problem-2:
Create a scale-free network using the Barabasi-Albert Model Algorithm (it
should contain not less than 10000 nodes). Apply ICM (Independent Cascade
Model) to find the number steps required to get to the maximum number of
nodes. This you may repeat 10 times by starting from different nodes and see
how many steps are required for the above and summarize it with average
and median. Activation probabilities for the pair of nodes which are needed
for ICM can be assigned randomly. When you are assigning it randomly, note
this point: from a node say v if there are three edges to different vertices w,x,
and y. Then it should be p(v,w)+p(v,x)+p(v,y)=1.
Scale-free Network :
A scale-free network is the one whose degree distribution follows the Power
Law or Pareto Distribution, at least asymptotically. pk ∼ k−γ
Scale free network revolves around the idea of Growth and Preferential
Attachment
● Growth indicates that there is always an increase in the number of nodes in
the network over time.
● Preferential Attachment suggests that a node existing in the network with a
higher degree has a higher probability of connecting to other new nodes
added in the network later. Higher degree nodes are seen to capture more
links than the ones with lesser degrees.
Solution:
Probability = 1/No. Of neighbors
Drawing the scale free network:
( For 10000 nodes)
Intermediate Cascade Model Implementation-
● Sender-centric model of cascade
● Each node has one chance to activate its neighbors
In ICM, nodes that are active are senders and nodes that are being activated
as receivers.
• Node activated at time 𝑡, has one chance, at time step 𝑡 + 1, to activate its
neighbors
• Assume 𝑣 is activated at time 𝑡 – For any neighbor 𝑤 of 𝑣, there’s a probability
𝑝𝑣𝑤 that node 𝑤 gets activated at time 𝑡 + 1.
• Node 𝑣 activated at time 𝑡 has a single chance of activating its neighbors –
The activation can only happen at 𝑡 + 1
NOTE :: The output is the mean of the spread in the network.
Results for GRAPH: 1
Results for GRAPH: 2
It can be seen that the spread increases as the network size grows; this is
because as the number of nodes increases due to preferential attachment
there is an emergence of central nodes.
ICM on Albert-Barabasi model
We can observe that:
1. The average number of activated nodes are : 8344
2. The median of the number of nodes activated is : 8356
Flixster:
Just by taking 6 nodes as the seed node we are able to get a mean spread of
16 i.e 1% of the total network. Where the activation probability is 1/no. of
neighbors.
The result is quite expected as the barabasi-algorithm has a lot of problems
associated with it and is unable to completely simulate real world networks.
Twitter:
Now if we consider our twitter network which is a real life example of scale
free network. With only 6 nodes as seed set and activation probability is 1/no.
of neighbors. The spread is 54,663 nodes i.e. 11% of the network which is quite
a remarkable result.