0% found this document useful (0 votes)
30 views36 pages

Twitter Social Networking Analysis

The document presents a comprehensive analysis of social networks using two datasets: Twitter and Flixster. It details various centrality measures, clustering coefficients, and network characteristics, employing Python libraries such as NetworkX and Matplotlib for calculations and visualizations. The analysis includes findings on degree centrality, eigenvector centrality, and the structure of giant components in both networks, along with a discussion on the scale-free network model and the Independent Cascade Model.

Uploaded by

Abhinav Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views36 pages

Twitter Social Networking Analysis

The document presents a comprehensive analysis of social networks using two datasets: Twitter and Flixster. It details various centrality measures, clustering coefficients, and network characteristics, employing Python libraries such as NetworkX and Matplotlib for calculations and visualizations. The analysis includes findings on degree centrality, eigenvector centrality, and the structure of giant components in both networks, along with a discussion on the scale-free network model and the Independent Cascade Model.

Uploaded by

Abhinav Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Social Network Analysis

Social Hub

Garvit Vijayvargia(19ucs201)
Tanush Garg(19ucs203)
Abhishek Kumar(19ucs241)
Abhinav Agarwal(19ucs254)

Course Instructor : Dr. Sakthi Balan

Department of Computer Science and Engineering


The LNM Institute of Information Technology, Jaipur
Round 1
Code - Drive Link
DATA SET 1
Twitter(ICWSM)

About dataset

ATTRIBUTES VALUES

Node meaning User

Edge meaning Follow

Network format Unipartite, directed

Edge type Unweighted, no multiple edges

Reciprocal Contains Reciprocal edges

Directed Cycles Contains Directed cycles

Loops Does not contain loops

Dataset Statistics

ATTRIBUTES VALUES

Size N = 465017

Volume M =834,797

Diameter 8 edges
Network Description

Twitter is a giant social media platform, where anyone can share their
opinions. Here people can follow anyone which results in the formation of a
massive directed network, basically, the network contains information about
who follows whom on Twitter, where nodes are the user and there is a
directed edge from the user to another user whom he/she wants to follow.

Libraries Used

NetworkX

NetworkX is a Python library used for creating and manipulating complex


graphs and networks. Using NetworkX we can generate many types of
random and classic networks, analyze network structure, build network
models, design new network algorithms and draw networks .

Matplotlib

Matplotlib is also a python library that is used for plotting. We can use this for
data visualization and graphical plotting, it is a good open-source alternative
to MATLAB.
Problem Statement

Find all centrality measures, clustering coefficients (both local and global),
reciprocity and transitivity that we have studied in the class using appropriate
algorithms.

We will import the datasets and then we will create the graph .
Calculating Degree Centrality

This is the measure of centrality of nodes with respect to their edges

IN DEGREE CENTRALITY:

OUT DEGREE CENTRALITY:


Here, incoming edges denote the followers of that user. The out-going edges
denote the follower of that user. The above values of degree centrality are
normalized by n-1 (n is number of nodes). Therefore, we can say that a user on
average has 1.7951 follower and 1.795 followee and that the user
corresponding to node 643 has the maximum follower with value=198.99 and
the user corres thiponding to node 3418 has the maximum followed with
value 499.99 while the user corresponding to node 2555 has the lowest
number follower with value =0 and the user corresponding to node 2 has the
lowest number followed with value = 0.
Calculating Eigenvector Centrality

This Centrality of the node is measured by using the importance of the


neighboring nodes.

From the above data we can see that node has the 5898 maximum
eigenvector centrality with value 0.04367 is while node 2555 has the lowest
eigenvector centrality with value = 6.8241*10 -17 . The average eigenvector
centrality of the network is 0.00050278.
Calculating Katz Centrality

This Biases the eigenvector centrality to avoid zero centrality values for nodes
with no incoming edges.

As per the above graph, the minimum closeness centrality is1.438503*10 -3 of


node 2555, the maximum closeness centrality is1.438503*10 -3 of node 643 and
the average closeness centrality of the network is 1.4655*10 -3
Calculating Betweenness Centrality

This centrality is computed on the basis of how important a node is in


connecting the other nodes

As per the above graph, the minimum closeness centrality is 0.0 of node 2,
the maximum closeness centrality is 3.65274*10 -2 of node 23719 and the
average closeness centrality of the network is 4.8328*10 -8
The node (user) which has the maximum value of betweenness centrality has
the most influence on the flow around a system. This user is the most central
in connecting other users of the network and hence holds authority over
disparate clusters in a network, or is on the periphery of most clusters .
Calculating Closeness Centrality

Centrality is measured in terms of average shortest path length of the nodes.

The node (user) having the maximum closeness centrality value is the best placed to
influence the entire network most quickly. This user can spread information to other
users of the network much faster than any other node and are “the broadcasters” of the
Network.

As per the above graph, the minimum closeness centrality is 0.0 of node 255, the
maximum closeness centrality is 1.3035267*10 -3 of node 643 and the average closeness
centrality of the network is 5.5425*10 -4.
Calculating Clustering Coefficient:

The above graph shows the local clustering coefficient for all the nodes. The local
clustering coefficient measures the neighbors’ level connectivity of a given node. In
the network, the user represented by node has the minimum clustering coefficient.
The average local clustering coefficient of the above network is 1.03936*10 -2
Calculating Transitivity & Reciprocity

Transitivity of the graph is the clustering coefficient of the network at the global level
and it is equal to 0.0001325. This value tells us the tendency of forming clusters
across the entire network.

Reciprocity is the fraction of reciprocal edges over all [Link] is 0.0030115 for this
network.
DATA SET 2: Flixster
About DataSet:

[Link]

Data Set Description:

ATTRIBUTES VALUES

Node meaning User

Edge meaning Friendship

Network format Unipartite, undirected

Edge type Unweighted, no multiple edges

Loops Does not contain loops

Size n=2523386

Volume m=7918801

Loop Count 0
About Network :

It was an American Social Networking movie website for discovering new movies,
meeting others with similar tastes in [Link] site allowed users to see movie
trailers as well as learn about new and upcoming movies in the box office.

Libraries Used:

NetworkX :

NetworkX is a Python package for the creation, manipulation, and study of the
structure, dynamics, and functions of complex networks. With NetworkX, we can
load and store networks in standard and nonstandard data formats, generate many
types of random and classic networks, analyze network structure, build network
models, design new network algorithms, draw networks, and much more.

MatplotLib:
Library used for plotting graphs.
Centrality Measure Studied:

Calculating Degree Centrality

Degree centrality is defined as the number of links incident upon a node. If the
network is directed , then two separate measures of degree centrality are defined,
namely, indegree and outdegree. Indegree is a count of the number of ties directed
to the node and outdegree is the number of ties that the node directs to others. In
such cases, the degree is the sum of indegree and outdegree.
Measures centrality of nodes with respect to their degree.
Here, edges denote friendship among the users. The above values of degree
centrality are normalized by n-1 ≈ 2.52*10 5 (maximum possible degree). Therefore,
we can say that a user on average has one friend ( 2.49 * 10 -6 * 2.52 * 10 5 ) and that
the user corresponding to node 67 has the maximum friends: around 146 (
0.00058*2.52*105 ) while the user corresponding to node 45821 has the lowest
number of friends: 0 (3.96*10 -7*2.52*10 5 ).

Calculating Katz Centrality

Katz centrality of a node is a measure of centrality in a network. It is used to


measure the relative degree of influence of a node within a social network. Katz
centrality measures influence by taking into account the total number of walks
between a pair of nodes. Biases the eigenvector centrality to avoid zero centrality
values for nodes with no incoming edges.

Computation can not be done because the run-time was too long.
Calculating Betweenness Centrality

It is a measure of centrality based on shortest paths. For every pair of vertices in a


connected graph, there exists at least one shortest path between the vertices such
that either the number of edges that the path passes through (for unweighted
graphs) or the sum of the weights of the edges (for weighted graphs) is minimized.
The betweenness centrality for each vertex is the number of these shortest paths
that pass through the vertex. Centrality is computed on the basis of how important a
node is in connecting the other nodes.

Computation can not be done because the run-time was too long. The node which
has the maximum value of betweenness centrality has the most influence on the
flow around a system. The user, the most central in connecting other users of the
network and hence holds authority over disparate clusters in a network.
Calculating Eigenvector Centrality

It is an algorithm that measures the transitive influence of nodes. Relationships


originating from high-scoring nodes contribute more to the score of a node than
connections from low-scoring nodes. A high eigenvector score means that a node is
connected to many nodes who themselves have high scores.
Centrality of the node is measured by using the importance of the neighboring
nodes.

We can see that node 38022 has the maximum eigenvector centrality while node
1504238 has the lowest eigenvector centrality. The average eigenvector centrality of
the network is 8.81*10 -5 .
Calculating Closeness Centrality

In a connected graph, closeness centrality of a node is a measure of centrality in a


network, calculated as the reciprocal of the sum of the length of the shortest paths
between the node and all other nodes in the graph. Thus, the more central a node is,
the closer it is to all other nodes. Centrality is measured in terms of average shortest
path length of the nodes.

Computation can not be done because the run-time was too long. The node (user)
having the maximum closeness centrality value is the best placed to influence the
entire network most quickly. This user can spread information to other users of the
network much faster than any other node and are “the broadcasters” of the network.
Calculating Clustering Coefficient

In graph theory, a clustering coefficient is a measure of the degree to which nodes in


a graph tend to cluster together.

The above graph shows the local clustering coefficient for all the nodes. The local
clustering coefficient measures the neighbors’ level connectivity of a given node. In
the network, the user represented by node 49 has the minimum clustering
coefficient of 0.
The average local clustering coefficient of the above network is 0.0833.
Calculating Transitivity & Reciprocity

Transitivity of the graph is the clustering coefficient of the network at the global level
9 and it is equal to 0.01365. This value tells us the tendency of forming clusters
across the entire network. Reciprocity is not to be studied in case of an undirected
graph.

Inferences
● The reciprocity value as seen above is 0 for the flixster network since it
is an undirected network thus, here by default all the connections are
mutual, whereas, since the Twitter Network is directed hence, it has
some reciprocity value.
● Flixster has more Transitivity than Twitter hence there are more
common friends.
● Flixster has a more compact network as their closeness centrality is
more.
Round 2
Code - Drive Link
Problem-1:
Using the two datasets that you had selected for the first round and the
corresponding networks that you had generated from those two datasets do
the following:

1. Find the giant component G in both the networks (note that giant
components it the largest connected subgraph the constricted
network/graph). Let N_G denote the number of nodes in G. Find N_G/N
where N is the total number of nodes in the network.

FLIXSTER:
RESULT:
EDGES (E) NODES (N) NODES IN GIANT NG/N
COMPONENT(NG)

100000 121436 45012 0.37066438

600000 460843 412413 0.8949099

1100000 706900 665468 0.9413891

1600000 913154 877531 0.960989

2100000 1094699 1064427 0.9723467

2600000 1258665 1233120 0.9797046

3100000 1412518 1391245 0.9849396

3600000 1553821 1536464 0.9888294

4100000 1686293 1672121 0.9915957

4600000 1813140 1802183 0.9939568

5100000 1932747 1923966 0.9954567


5600000 2047314 2040141 0.9964963

6100000 2157254 2152267 0.9976882

6600000 2262252 2258709 0.9984338

7100000 2363449 2361464 0.9991601

7600000 2462666 2462004 0.9997311

TWITTER:
RESULT:
EDGES (E) NODES (N) NODES IN GIANT NG/N
COMPONENT(NG)

100000 82390 80087 0.97204757

160000 123058 121887 0.99048416

220000 160946 160097 0.99472493

280000 196260 196019 0.99877203

340000 229808 229666 0.99938209


400000 261884 261497 0.99852224

460000 292858 292766 0.99968585

520000 322496 322470 0.99991937

580000 351841 351807 0.99990336

640000 379257 379233 0.99993671

700000 406582 406569 0.99996802

760000 432983 432980 0.9999930

820000 458697 458692 0.99998909


Problem-2:
Create a scale-free network using the Barabasi-Albert Model Algorithm (it
should contain not less than 10000 nodes). Apply ICM (Independent Cascade
Model) to find the number steps required to get to the maximum number of
nodes. This you may repeat 10 times by starting from different nodes and see
how many steps are required for the above and summarize it with average
and median. Activation probabilities for the pair of nodes which are needed
for ICM can be assigned randomly. When you are assigning it randomly, note
this point: from a node say v if there are three edges to different vertices w,x,
and y. Then it should be p(v,w)+p(v,x)+p(v,y)=1.

Scale-free Network :

A scale-free network is the one whose degree distribution follows the Power
Law or Pareto Distribution, at least asymptotically. pk ∼ k−γ

Scale free network revolves around the idea of Growth and Preferential
Attachment

● Growth indicates that there is always an increase in the number of nodes in


the network over time.

● Preferential Attachment suggests that a node existing in the network with a


higher degree has a higher probability of connecting to other new nodes
added in the network later. Higher degree nodes are seen to capture more
links than the ones with lesser degrees.

Solution:

Probability = 1/No. Of neighbors


Drawing the scale free network:
( For 10000 nodes)

Intermediate Cascade Model Implementation-


● Sender-centric model of cascade
● Each node has one chance to activate its neighbors

In ICM, nodes that are active are senders and nodes that are being activated
as receivers.

• Node activated at time 𝑡, has one chance, at time step 𝑡 + 1, to activate its
neighbors

• Assume 𝑣 is activated at time 𝑡 – For any neighbor 𝑤 of 𝑣, there’s a probability


𝑝𝑣𝑤 that node 𝑤 gets activated at time 𝑡 + 1.
• Node 𝑣 activated at time 𝑡 has a single chance of activating its neighbors –
The activation can only happen at 𝑡 + 1

NOTE :: The output is the mean of the spread in the network.


Results for GRAPH: 1

Results for GRAPH: 2

It can be seen that the spread increases as the network size grows; this is
because as the number of nodes increases due to preferential attachment
there is an emergence of central nodes.
ICM on Albert-Barabasi model

We can observe that:

1. The average number of activated nodes are : 8344

2. The median of the number of nodes activated is : 8356

Flixster:
Just by taking 6 nodes as the seed node we are able to get a mean spread of
16 i.e 1% of the total network. Where the activation probability is 1/no. of
neighbors.

The result is quite expected as the barabasi-algorithm has a lot of problems


associated with it and is unable to completely simulate real world networks.

Twitter:

Now if we consider our twitter network which is a real life example of scale
free network. With only 6 nodes as seed set and activation probability is 1/no.
of neighbors. The spread is 54,663 nodes i.e. 11% of the network which is quite
a remarkable result.

You might also like