Graph Analytics with Greenplum and Apache MADlib

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Graph Analytics
with Greenplum and Apache MADlib
Pivotal Korea
HongDon Lee, Sr. Data Scientist
30th January 2019

Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Graph
Analytics
Why
Where
to use
What
How

Everything is connected!
“Nothing ever exists
entirely alone.
Everything is
in relation to
everything else
(緣起)”
“Learn how to see:
Everything is
connected to
everything else”
“In nature we never
see anything isolated,
but everything in
connection with
something else which
is before it, beside it,
under it and over it”
Buddha Leonardo
da vinci
Goethe
3

What a Small World!
“6 Degrees of Separation”
1973, Stanley Milgram, Small-world experiment
1 2 3 4 5 6
4

From Reductionism to Holism
Reductionism Holism
“Divide and Conquer”
vs.
“Everything has to be understood
in relation to the whole”
5

From Individual to Relation
Time
Features
2019.01.01
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
At the individual levelDemographics
Behaviors
Preferences
Economic Status
Education Background
...
W
ho
are
you?
6

From Individual to Relation
Time
Features
Demographics
Behaviors
Preferences
Economic Status
Education Background
2019.01.01
...
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
Relation/ Connection
Family
Friends
Colleagues
Community
...
“Tell me who your friends are
and I’ll tell you who you are”
- Mexican Proverb -
7

Graph Analytics, one of the Data Scientist’s knifes
Graph Analytics
t-Test, ANOVA
CNN, RNN, GAN
Random Forest,
XGBoost
Bayesian
Statistics
Regression,
Logistic Regression
PCA, factor
analysis
Clustering
Text Analysis, NLP
Depends on
business
problem and
data
8

Network: Everywhere with Everything, All the time
MMO Role-Playing Game
* www.researchgate.net
Chemistry
* https://www.nature.com/articles/
Social Network Epidemiology
* http://www.netminer.com/community* Grandjean, M. (2016)
Bank Risk
* https://cambridge-intelligence.com
1st
Party Fraud Manufacturing
* www.infoglide.com * https://blog.trifinance.com* www.researchgate.net
Gene
9

Use Cases - PageRank
● Measures the importance of a vertex in a graph by counting the number and
quality of the links to that vertex
❏ Web Search
❏ Scientific impact of researchers
❏ Neuroscience
❏ Street and space usage
* Image from https://en.wikipedia.org/wiki/PageRank
10

Use Cases - Single Source Shortest Path
● Find a path to every vertex so that the sum of the weights of its constituent
edges is minimized
❏ Vehicle routing/ navigation
❏ Degrees of separation in a social network
❏ Mid-delay path in a telecommunications network
❏ Plant and facility layout
❏ VLSI (Very-Large-Scale Integration)
design
11

Use Cases - Cyber-security by Graph model
● Using historical window events data to build
historical graphs* of typical user behavior
• Which machines does the user log in to?
• Which machines does the user log in from?
• How often?
• In which order?
● Is this behavior typical?
• Is it typical for this user?
• Is it typical for someone in a particular department?
• Is this typical for someone in the user’s job role?
● Graph models are sensitive to direction, order,
and frequency.
34.23.123.4
Typical Behavior
Anomalous Behavior
DB with financial
information
34.23.123.51
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.4
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.51
*Reference: Alexander D. Kenta, Lorie M. Liebrock, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.
vs.
12

Use Cases - Connected Component
● Calculate the Jaccard Dissimilarity
Scores for each pair of materials
● If material X and Y are potential
duplicates and material Y and Z are
potential duplicates then X, Y, Z is a
connected component in the graph of
all materials and form a cluster
⇒ Connected component analysis
resulted in 10% of materials identified
as potential duplicates based on their
bill of material attributes
Z
X Y
Z
Features for each material
• part type
• material type & group
• product line & family
• revision key
• weld, material & coating
specs
• quality matrix
• unit of measurement
• Weight
:
13

The Origin of Graph Theory
● Seven bridges of Konigsberg problem
● Leonhard Euler, a mathematician, proved that the problem has no solution
“The problem was to devise a walk through the city that would
cross each of those bridges once and only once.”
Euler, 1753
15

[ Terminology of Graph theory ]
What is Graph Theory?
● Graph theory is the study of graphs, which are mathematical structures
used to model pairwise relations between objects.
0 1
2
4
3
5
6
7
1
2
10
1
10
1
1
3
1
-2
1
1
vertice
edge weight ● Vertex
● Node
● Point
● Actor
V
● Edge
● Link
● Arc
● Line
Directed
Undirected
[ Directed Network Graph with Weight (example) ]
10
Weight
E
16

Graph Algorithms and Measures
Group
Structure
Centrality
Types Question Feasures
Path
“What are the sub-graphs,
component, communities?”
“What is the character of the
network structure?”
“What is the most important
vertices within a graph”
“What is the shortest
path(distance) among vertices”
weakly-connected component
Density, Diameter,
Average path length,
Modularity
Degree (in/out, weight),
Closeness,
PageRank, Hub, Authority,
Betweenness,
Clustering coefficient
Single source shortest path,
All pairs shortest path,
Breadth-First Search
Graph-based
Features
1
2
3
4
17

Graph Algorithms and Measures - (1) Group
Group
Structure
Centrality
Path
Weakly-Connected Component
● A Connected Component (or just Component) of an undirected graph is
subgraph in which any two vertices are connected to each other by paths,
and which is connected to no additional vertices in the supergraph
* source: Wikipedia
[ A supergraph with three connected components ]
Component 1
Component 2
Component 3
Supergraph
18

D =
|E|
|V| (|V| - 1) / 2
Graph Algorithms and Measures - (2) Structure
Group
Structure
Centrality
Path
Density
● A dense graph is a graph in which the number of edges is close to the
maximal number of edges. The opposite, a graph with only a few edges, is a
sparse graph. The distinction between sparse and dense graphs is rather vague,
and depends on the context.
* source: Wikipedia
❏ For Undirected simple graphs
❏ For Directed simple graphs
D =
|E|
|V| (|V| - 1)
E : the number of Edges, V : the number of Vertices
[ Density by components (example) ]
D=
|6|
|4|(|4|-1)/2
D =
|3|
|4|(|4|-1)/2
=1 =0.5
19

Graph Algorithms and Measures - (3) Path
Group
Structure
Centrality
Path
Single Source Shortest Path (SSSP)
● Given a graph and a source vertex, the Single Source Shortest Path (SSSP)
algorithm finds a path from the source vertex to every other vertex in the
graph, such that the sum of the weights of the path is minimized.
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ Shortest paths from vertex ‘0’ (example) ]
ID weight parent
0 0 0
1 1 0
2 1 0
3 2 (= 1+1) 2
4 10 0
5 2 2
6 3 5
7 4 6
* weight : The total weight of the shortest path from the source vertex to this particular vertex.
* parent : The parent of this vertex in the shortest path from source.
23
0
20

Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
In-Degree, Out-Degree
● The node in-degree is the number of edges pointing in to the node
● The node out-degree is the number of edges pointing out of the node
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ The in-out degree for each node (example) ]
ID In-degree Out-degree
0 2 3
1 1 2
2 2 3
3 2 1
4 1 1
5 1 1
6 2 1
7 1 0
21

Group
Structure
Centrality
Path
PageRank (1 / 2)
The size of each face is proportional to the total
size of the other faces which are pointing to it.
- PR(A): PageRank of node A
- N: the total number of Nodes
- L(B): the number of Links from node B
- d: damping factor (probability, at any step, that a surfer will
continue randomly clicking on links)
● PageRank works by counting the number and quality of links to a page to
determine a rough estimate of how important the website is. The underlying
assumption is that more important websites are likely to receive more links
from other websites
* source: Wikipedia
22

Group
Structure
Centrality
Path
PageRank (2 / 2)
A
B
C
D
0.25
0.25
0.25
0.25
1st round 2nd round
PR(A) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/1 +
0.25/3) = 0.427
PR(B) = (1-0.85)/4 +
0.85*(0.25/3) = 0.108
PR(C) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/3) =
0.214
PR(D) = (1-0.85)/4 +
0.85*0 = 0.037
PR(A) = 0.25
PR(B) =0.25
PR(C) = 0.25
PR(D) = 0.25
Final round
A
B
C
D
0.108
0.214
0.037
0.427
...
Recursive calculation → converged
A
B
C
D
0.048
0.069
0.038
0.127
* PR(A): PageRank of node A, N: the total number of Nodes, L(B): the number of Links from node B, d: damping factor (typically 0.85) 23

Group
Structure
Centrality
Path
Closeness
● Closeness of a node is a measure of centrality in a network, calculated as the
sum of the length of the shortest paths between the node and all other
nodes in the graph (ie, based on All Pairs Shortest Path)
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ All Pairs Shortest Path ]
source destination weight
0 0 0
0 1 1
0 2 1
0 3 2
0 4 10
0 5 2
0 6 3
0 7 4
1 0 4
1 1 0
1 2 2
1 3 3
1 4 14
1 5 3
1 6 4
1 7 5
2 0 2
...
[ Closeness Centrality ]
src_id closeness
0 0.043
1 0.028
2 0.041
3 0.035
* N: The number of nodes, * d(y, x) : The distance between y and x node 24

Big Issue of Graph Algorithms - High Complexity
* image source: https://www.xkcd.com/399/
25

Big Issue of Graph Algorithms - High Complexity
Type Algorithms/ Measures Time Complexity
Group Weakly-Connected Component O(|V| + |E|)
Structure
Density O(|V|+|E|log|E|)
Diameter O(|V|3
)
Path
All Pairs Shortest Path O(|V|3
)
Single Source Shortest Path O(|V|2
)
Breadth-First Search O(E + V)
Centrality
In-Degree, Out-Degree O(|V| + |E|)
Closeness Centrality O(|V|3
)
PageRank O(log(network size)/(1-damping factor))
Betweenness Centrality O(|V|2
log|V|+|V||E|)
* |V|: the number of Vertices in graph
* |E|: the number of Edges in graph
● Computationally
Intensive
- exponential to the
number of vertices
and edges
Graph Analysis
at Scale,
parallel processing
with MADlib
on Greenplum
26

Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Why
Where
to use
What
How
Graph
Analytics

Tools for Graph Analytics
● Graph Analytics at Scale with Open Source MADlib on Greenplum
Commercial
Open Source
Small Data Big Data/ Parallel Processing
Data Size & Processing
WhetherOSSornot
sna,
igraph,
ergm,
network
NetworkX,
graph-tool,
SNAP,
pygraphviz
...
Interactive visualization focused Graph DB
28

Analytics Platform, GPDB
Graph Analytics at Scale
● Designed for very large graphs
(billions of vertices/edges)
● No need to move data and transform
for external graph engine
- One analytics database to deploy and
manage
● Familiar SQL interface
● Combine context-based graph
analytics with other content-based
techniques
❏ Advanced Analytics In Database
Extended Language
GPText
❏ Scale Out
❏ MPP (Massively Parallel Processing) Architecture
REGRESSIONCLASSIFICATIONCLUSTERING GEOSPATIAL GRAPHTEXT IMAGE
Graph Analytics with MADlib on Greenplum
29

: Scaleable, In-Database Machine Learning
● Open source https://github.com/apache/madlib
● Downloads and docs http://madlib.apache.org/
● Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache
project
For
PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and
analytics for data
scientists
30

: Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
April 2018
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced/ Random/ Stratified Sampling
31

: Graph Representation in MADlib
Source
Vertex
Dest
Vertex
Edge
Weight
Edge
Params
0 3 1.0 ...
1 0 5.0 ...
1 2 3.0 ...
2 3 8.0 ...
3 0 3.0 ...
3 1 2.0 ...
Vertex Vertex
Params
0 ...
1 ...
2 ...
3 ...
. . . . . .
Vertex Table Edge Table
...
...
0
1
2
3
5
3
8
2
1
3
[ Directed Graph (example) ]
V
32

example : PageRank in MADlib
● Create vertex and edge tables to represent the graph
* http://madlib.apache.org/docs/latest/group__grp__pagerank.html
DROP TABLE IF EXISTS vertex;
CREATE TABLE vertex(
id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6);
DROP TABLE IF EXISTS edge;
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
)
DISTRIBUTED BY (user_id);
INSERT INTO edge VALUES
(0, 1, 1), (0, 2, 1), -- user id 1
(0, 4, 1), (1, 2, 1),
(1, 3, 1), (2, 3, 1),
(2, 5, 1), (2, 6, 1),
(3, 0, 1), (4, 0, 1),
(5, 6, 1), (6, 3, 1),
(0, 1, 2), (0, 2, 2), -- user id 2
(0, 4, 2), (1, 2, 2),
(1, 3, 2), (2, 3, 2),
(3, 0, 2), (4, 0, 2),
(5, 6, 2), (6, 3, 2);
33

● Compute the PageRank with All IDs
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_out' -- Output table of RageRank
, NULL); -- Damping factor (default 0.85)
SELECT * FROM pagerank_out
ORDER BY pagerank DESC;
34

● Network Diagram by Graphviz and PyGraphviz
* PyGraphviz is a Python interface to the Graphviz graph layout and visualization package
A vertex with a high PageRank is usually
considered more "important" or more
"influential" or more "relevant" than a
vertex with a low PageRank.
Size of node is proportional
to PageRank value
User ID 1
User ID 2
ID
(PageRank)
35

● PageRank of vertices associated with each user by the grouping feature
DROP TABLE IF EXISTS pagerank_gr_out, pagerank_gr_out_summary;
, 'pagerank_gr_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, 0.00000001 -- Threshold
, 'user_id' -- Grouping column name
);
SELECT * FROM pagerank_gr_out
ORDER BY user_id, pagerank DESC;
PageRank
of user id 1
PageRank
of user id 2
36

● Personalized PageRank of vertices {2, 4}, for Recommendations
DROP TABLE IF EXISTS pagerank_pers_out,
pagerank_pers_out_summary;
, 'pagerank_pers_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, NULL -- Default Threshold (1/number of vertices*1000)
, NULL -- No Grouping
, '{2, 4}' -- Personalization vertices
);
SELECT * FROM pagerank_pers_out
ORDER BY pagerank DESC;
SELECT * FROM
pagerank_pers_out_summary;
* Personalized PageRank = (1-p)*Ax + p*E , where ‘E’ is the list of vertices for personalized PageRank, ‘p’ is the damping factor 37

Greenplum cluster:
● 1 master
● 4 segment hosts with 6
segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
* Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
PageRank Performance on Greenplum w/ MADlib
38

In Summary
● Capture the Relationship in Networks using Graph Analytics
→ Community, Structure, Path, Centrality
→ Combine context-based graph analytics with other content-based insights
● Graph analytics at SCALE with Open Source Software
→ Apache MADlib on Greenplum, massively parallel processing
39

One more thing...
GREENPLUM SUMMIT at PostgresConf 2019
by Pivotal
40

Graph Analytics with Greenplum and Apache MADlib

More Related Content

Similar to Graph Analytics with Greenplum and Apache MADlib

More from VMware Tanzu

Recently uploaded

Graph Analytics with Greenplum and Apache MADlib