© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Graph Analytics
with Greenplum and Apache MADlib
Pivotal Korea
HongDon Lee, Sr. Data Scientist
30th January 2019
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Graph
Analytics
Why
Where
to use
What
How
Everything is connected!
“Nothing ever exists
entirely alone.
Everything is
in relation to
everything else
(緣起)”
“Learn how to see:
Everything is
connected to
everything else”
“In nature we never
see anything isolated,
but everything in
connection with
something else which
is before it, beside it,
under it and over it”
Buddha Leonardo
da vinci
Goethe
3
What a Small World!
“6 Degrees of Separation”
1973, Stanley Milgram, Small-world experiment
1 2 3 4 5 6
4
From Reductionism to Holism
Reductionism Holism
“Divide and Conquer”
vs.
“Everything has to be understood
in relation to the whole”
5
From Individual to Relation
Time
Features
2019.01.01
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
At the individual levelDemographics
Behaviors
Preferences
Economic Status
Education Background
...
W
ho
are
you?
6
From Individual to Relation
Time
Features
Demographics
Behaviors
Preferences
Economic Status
Education Background
2019.01.01
...
2019.01.02
2019.01.03
2019.01.04
2019.01.05
2019.01.06
2019.01.30
...
Cross-sectional
Perspective
Longitudinal
Perspective
Relation/ Connection
Family
Friends
Colleagues
Community
...
“Tell me who your friends are
and I’ll tell you who you are”
- Mexican Proverb -
7
Graph Analytics, one of the Data Scientist’s knifes
Graph Analytics
t-Test, ANOVA
CNN, RNN, GAN
Random Forest,
XGBoost
Bayesian
Statistics
Regression,
Logistic Regression
PCA, factor
analysis
Clustering
Text Analysis, NLP
Depends on
business
problem and
data
8
Network: Everywhere with Everything, All the time
MMO Role-Playing Game
* www.researchgate.net
Chemistry
* https://www.nature.com/articles/
Social Network Epidemiology
* http://www.netminer.com/community* Grandjean, M. (2016)
Bank Risk
* https://cambridge-intelligence.com
1st
Party Fraud Manufacturing
* www.infoglide.com * https://blog.trifinance.com* www.researchgate.net
Gene
9
Use Cases - PageRank
● Measures the importance of a vertex in a graph by counting the number and
quality of the links to that vertex
❏ Web Search
❏ Scientific impact of researchers
❏ Neuroscience
❏ Street and space usage
* Image from https://en.wikipedia.org/wiki/PageRank
10
Use Cases - Single Source Shortest Path
● Find a path to every vertex so that the sum of the weights of its constituent
edges is minimized
❏ Vehicle routing/ navigation
❏ Degrees of separation in a social network
❏ Mid-delay path in a telecommunications network
❏ Plant and facility layout
❏ VLSI (Very-Large-Scale Integration)
design
11
Use Cases - Cyber-security by Graph model
● Using historical window events data to build
historical graphs* of typical user behavior
• Which machines does the user log in to?
• Which machines does the user log in from?
• How often?
• In which order?
● Is this behavior typical?
• Is it typical for this user?
• Is it typical for someone in a particular department?
• Is this typical for someone in the user’s job role?
● Graph models are sensitive to direction, order,
and frequency.
34.23.123.4
Typical Behavior
Anomalous Behavior
DB with financial
information
34.23.123.51
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.4
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.51
*Reference: Alexander D. Kenta, Lorie M. Liebrock, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.
vs.
12
Use Cases - Connected Component
● Calculate the Jaccard Dissimilarity
Scores for each pair of materials
● If material X and Y are potential
duplicates and material Y and Z are
potential duplicates then X, Y, Z is a
connected component in the graph of
all materials and form a cluster
⇒ Connected component analysis
resulted in 10% of materials identified
as potential duplicates based on their
bill of material attributes
Z
X Y
Z
Features for each material
• part type
• material type & group
• product line & family
• revision key
• weld, material & coating
specs
• quality matrix
• unit of measurement
• Weight
:
13
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Graph
Analytics
Why
Where
to use
What
How
The Origin of Graph Theory
● Seven bridges of Konigsberg problem
● Leonhard Euler, a mathematician, proved that the problem has no solution
“The problem was to devise a walk through the city that would
cross each of those bridges once and only once.”
Euler, 1753
15
[ Terminology of Graph theory ]
What is Graph Theory?
● Graph theory is the study of graphs, which are mathematical structures
used to model pairwise relations between objects.
0 1
2
4
3
5
6
7
1
2
10
1
10
1
1
3
1
-2
1
1
vertice
edge weight ● Vertex
● Node
● Point
● Actor
V
● Edge
● Link
● Arc
● Line
Directed
Undirected
[ Directed Network Graph with Weight (example) ]
10
Weight
E
16
Graph Algorithms and Measures
Group
Structure
Centrality
Types Question Feasures
Path
“What are the sub-graphs,
component, communities?”
“What is the character of the
network structure?”
“What is the most important
vertices within a graph”
“What is the shortest
path(distance) among vertices”
weakly-connected component
Density, Diameter,
Average path length,
Modularity
Degree (in/out, weight),
Closeness,
PageRank, Hub, Authority,
Betweenness,
Clustering coefficient
Single source shortest path,
All pairs shortest path,
Breadth-First Search
Graph-based
Features
1
2
3
4
17
Graph Algorithms and Measures - (1) Group
Group
Structure
Centrality
Path
Weakly-Connected Component
● A Connected Component (or just Component) of an undirected graph is
subgraph in which any two vertices are connected to each other by paths,
and which is connected to no additional vertices in the supergraph
* source: Wikipedia
[ A supergraph with three connected components ]
Component 1
Component 2
Component 3
Supergraph
18
D =
|E|
|V| (|V| - 1) / 2
Graph Algorithms and Measures - (2) Structure
Group
Structure
Centrality
Path
Density
● A dense graph is a graph in which the number of edges is close to the
maximal number of edges. The opposite, a graph with only a few edges, is a
sparse graph. The distinction between sparse and dense graphs is rather vague,
and depends on the context.
* source: Wikipedia
❏ For Undirected simple graphs
❏ For Directed simple graphs
D =
|E|
|V| (|V| - 1)
E : the number of Edges, V : the number of Vertices
[ Density by components (example) ]
D=
|6|
|4|(|4|-1)/2
D =
|3|
|4|(|4|-1)/2
=1 =0.5
19
Graph Algorithms and Measures - (3) Path
Group
Structure
Centrality
Path
Single Source Shortest Path (SSSP)
● Given a graph and a source vertex, the Single Source Shortest Path (SSSP)
algorithm finds a path from the source vertex to every other vertex in the
graph, such that the sum of the weights of the path is minimized.
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ Shortest paths from vertex ‘0’ (example) ]
ID weight parent
0 0 0
1 1 0
2 1 0
3 2 (= 1+1) 2
4 10 0
5 2 2
6 3 5
7 4 6
* weight : The total weight of the shortest path from the source vertex to this particular vertex.
* parent : The parent of this vertex in the shortest path from source.
23
0
20
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
In-Degree, Out-Degree
● The node in-degree is the number of edges pointing in to the node
● The node out-degree is the number of edges pointing out of the node
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ The in-out degree for each node (example) ]
ID In-degree Out-degree
0 2 3
1 1 2
2 2 3
3 2 1
4 1 1
5 1 1
6 2 1
7 1 0
21
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
PageRank (1 / 2)
The size of each face is proportional to the total
size of the other faces which are pointing to it.
- PR(A): PageRank of node A
- N: the total number of Nodes
- L(B): the number of Links from node B
- d: damping factor (probability, at any step, that a surfer will
continue randomly clicking on links)
● PageRank works by counting the number and quality of links to a page to
determine a rough estimate of how important the website is. The underlying
assumption is that more important websites are likely to receive more links
from other websites
* source: Wikipedia
22
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
PageRank (2 / 2)
A
B
C
D
0.25
0.25
0.25
0.25
1st round 2nd round
PR(A) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/1 +
0.25/3) = 0.427
PR(B) = (1-0.85)/4 +
0.85*(0.25/3) = 0.108
PR(C) = (1-0.85)/4 +
0.85*(0.25/2 + 0.25/3) =
0.214
PR(D) = (1-0.85)/4 +
0.85*0 = 0.037
PR(A) = 0.25
PR(B) =0.25
PR(C) = 0.25
PR(D) = 0.25
Final round
A
B
C
D
0.108
0.214
0.037
0.427
...
Recursive calculation → converged
A
B
C
D
0.048
0.069
0.038
0.127
* PR(A): PageRank of node A, N: the total number of Nodes, L(B): the number of Links from node B, d: damping factor (typically 0.85) 23
Graph Algorithms and Measures - (4) Centrality
Group
Structure
Centrality
Path
Closeness
● Closeness of a node is a measure of centrality in a network, calculated as the
sum of the length of the shortest paths between the node and all other
nodes in the graph (ie, based on All Pairs Shortest Path)
0 1
2
4
3
5
6 7
1
2
10
1
10
1
1
3
1
-2
1
1
[ All Pairs Shortest Path ]
source destination weight
0 0 0
0 1 1
0 2 1
0 3 2
0 4 10
0 5 2
0 6 3
0 7 4
1 0 4
1 1 0
1 2 2
1 3 3
1 4 14
1 5 3
1 6 4
1 7 5
2 0 2
...
[ Closeness Centrality ]
src_id closeness
0 0.043
1 0.028
2 0.041
3 0.035
* N: The number of nodes, * d(y, x) : The distance between y and x node 24
Big Issue of Graph Algorithms - High Complexity
* image source: https://www.xkcd.com/399/
25
Big Issue of Graph Algorithms - High Complexity
Type Algorithms/ Measures Time Complexity
Group Weakly-Connected Component O(|V| + |E|)
Structure
Density O(|V|+|E|log|E|)
Diameter O(|V|3
)
Path
All Pairs Shortest Path O(|V|3
)
Single Source Shortest Path O(|V|2
)
Breadth-First Search O(E + V)
Centrality
In-Degree, Out-Degree O(|V| + |E|)
Closeness Centrality O(|V|3
)
PageRank O(log(network size)/(1-damping factor))
Betweenness Centrality O(|V|2
log|V|+|V||E|)
* |V|: the number of Vertices in graph
* |E|: the number of Edges in graph
● Computationally
Intensive
- exponential to the
number of vertices
and edges
Graph Analysis
at Scale,
parallel processing
with MADlib
on Greenplum
26
Agenda
1. Why Graph Analytics?
2. What is Graph Analytics?
3. Graph Analytics w/ MADlib
Why
Where
to use
What
How
Graph
Analytics
Tools for Graph Analytics
● Graph Analytics at Scale with Open Source MADlib on Greenplum
Commercial
Open Source
Small Data Big Data/ Parallel Processing
Data Size & Processing
WhetherOSSornot
sna,
igraph,
ergm,
network
NetworkX,
graph-tool,
SNAP,
pygraphviz
...
Interactive visualization focused Graph DB
28
Analytics Platform, GPDB
Graph Analytics at Scale
● Designed for very large graphs
(billions of vertices/edges)
● No need to move data and transform
for external graph engine
- One analytics database to deploy and
manage
● Familiar SQL interface
● Combine context-based graph
analytics with other content-based
techniques
❏ Advanced Analytics In Database
Extended Language
GPText
❏ Scale Out
❏ MPP (Massively Parallel Processing) Architecture
REGRESSIONCLASSIFICATIONCLUSTERING GEOSPATIAL GRAPHTEXT IMAGE
Graph Analytics with MADlib on Greenplum
29
: Scaleable, In-Database Machine Learning
● Open source https://github.com/apache/madlib
● Downloads and docs http://madlib.apache.org/
● Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache
project
For
PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and
analytics for data
scientists
30
: Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
April 2018
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced/ Random/ Stratified Sampling
31
: Graph Representation in MADlib
Source
Vertex
Dest
Vertex
Edge
Weight
Edge
Params
0 3 1.0 ...
1 0 5.0 ...
1 2 3.0 ...
2 3 8.0 ...
3 0 3.0 ...
3 1 2.0 ...
Vertex Vertex
Params
0 ...
1 ...
2 ...
3 ...
. . . . . .
Vertex Table Edge Table
...
...
0
1
2
3
5
3
8
2
1
3
[ Directed Graph (example) ]
V
32
example : PageRank in MADlib
● Create vertex and edge tables to represent the graph
* http://madlib.apache.org/docs/latest/group__grp__pagerank.html
DROP TABLE IF EXISTS vertex;
CREATE TABLE vertex(
id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6);
DROP TABLE IF EXISTS edge;
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
)
DISTRIBUTED BY (user_id);
INSERT INTO edge VALUES
(0, 1, 1), (0, 2, 1), -- user id 1
(0, 4, 1), (1, 2, 1),
(1, 3, 1), (2, 3, 1),
(2, 5, 1), (2, 6, 1),
(3, 0, 1), (4, 0, 1),
(5, 6, 1), (6, 3, 1),
(0, 1, 2), (0, 2, 2), -- user id 2
(0, 4, 2), (1, 2, 2),
(1, 3, 2), (2, 3, 2),
(3, 0, 2), (4, 0, 2),
(5, 6, 2), (6, 3, 2);
33
example : PageRank in MADlib
● Compute the PageRank with All IDs
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_out' -- Output table of RageRank
, NULL); -- Damping factor (default 0.85)
SELECT * FROM pagerank_out
ORDER BY pagerank DESC;
34
example : PageRank in MADlib
● Network Diagram by Graphviz and PyGraphviz
* PyGraphviz is a Python interface to the Graphviz graph layout and visualization package
A vertex with a high PageRank is usually
considered more "important" or more
"influential" or more "relevant" than a
vertex with a low PageRank.
Size of node is proportional
to PageRank value
User ID 1
User ID 2
ID
(PageRank)
35
example : PageRank in MADlib
● PageRank of vertices associated with each user by the grouping feature
DROP TABLE IF EXISTS pagerank_gr_out, pagerank_gr_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_gr_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, 0.00000001 -- Threshold
, 'user_id' -- Grouping column name
);
SELECT * FROM pagerank_gr_out
ORDER BY user_id, pagerank DESC;
PageRank
of user id 1
PageRank
of user id 2
36
example : PageRank in MADlib
● Personalized PageRank of vertices {2, 4}, for Recommendations
DROP TABLE IF EXISTS pagerank_pers_out,
pagerank_pers_out_summary;
SELECT madlib.pagerank(
'vertex' -- Vertex table
, 'id' -- Vertex id column
, 'edge' -- Edge table
, 'src=src, dest=dest' -- Comma delimited string of edge arguments
, 'pagerank_pers_out' -- Output table of PageRank
, NULL -- Default damping factor (0.85)
, NULL -- Default max iterations (100)
, NULL -- Default Threshold (1/number of vertices*1000)
, NULL -- No Grouping
, '{2, 4}' -- Personalization vertices
);
SELECT * FROM pagerank_pers_out
ORDER BY pagerank DESC;
SELECT * FROM
pagerank_pers_out_summary;
* Personalized PageRank = (1-p)*Ax + p*E , where ‘E’ is the list of vertices for personalized PageRank, ‘p’ is the damping factor 37
example : PageRank in MADlib
Greenplum cluster:
● 1 master
● 4 segment hosts with 6
segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
* Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
PageRank Performance on Greenplum w/ MADlib
38
In Summary
● Capture the Relationship in Networks using Graph Analytics
→ Community, Structure, Path, Centrality
→ Combine context-based graph analytics with other content-based insights
● Graph analytics at SCALE with Open Source Software
→ Apache MADlib on Greenplum, massively parallel processing
39
One more thing...
GREENPLUM SUMMIT at PostgresConf 2019
by Pivotal
40
Transforming How The World Builds Software
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.

More Related Content

PPTX
Multi-legged Robot Walking Strategies, with an Emphasis on Image-based Methods
PPTX
Graph based approaches to Gene Expression Clustering
PDF
Matrix Factorization
PDF
Ijcatr04041019
PPTX
Segmenting Sequences of Node-labeled Graphs
PPTX
SnapNETS: Automatic Segmentation of Network Sequences with Node Labels
PDF
Topological Data Analysis of Complex Spatial Systems
PDF
Struct element types
Multi-legged Robot Walking Strategies, with an Emphasis on Image-based Methods
Graph based approaches to Gene Expression Clustering
Matrix Factorization
Ijcatr04041019
Segmenting Sequences of Node-labeled Graphs
SnapNETS: Automatic Segmentation of Network Sequences with Node Labels
Topological Data Analysis of Complex Spatial Systems
Struct element types

Similar to Graph Analytics with Greenplum and Apache MADlib (20)

PPT
CS583-unsupervised-learning CS583-unsupervised-learning.ppt
PPT
unsupervised-learning AI information and analytics
PDF
Using Networks to Measure Influence and Impact
PPT
CS583-unsupervised-learning.ppt
PPT
CS583-unsupervised-learning.ppt learning
PDF
Improve ML Predictions using Graph Analytics (today!)
PPT
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PPTX
Attentive Relational Networks for Mapping Images to Scene Graphs
PPTX
mini project_shortest path visualizer.pptx
PPTX
5.SocialMediaAnalytics2022gggggggggg.pptx
PDF
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
PPTX
A Graph Summarization: A Survey | Summarizing and understanding large graphs
PDF
clustering using different methods in .pdf
PPTX
Spanning Tree in data structure and .pptx
PDF
HalifaxNGGs
PDF
IRJET- Survey on Implementation of Graph Theory in Routing Protocols of Wired...
PPTX
Unsupervised Learning.pptx
PDF
Hierarchical Clustering
PPTX
Keynote at AImWD
CS583-unsupervised-learning CS583-unsupervised-learning.ppt
unsupervised-learning AI information and analytics
Using Networks to Measure Influence and Impact
CS583-unsupervised-learning.ppt
CS583-unsupervised-learning.ppt learning
Improve ML Predictions using Graph Analytics (today!)
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
Attentive Relational Networks for Mapping Images to Scene Graphs
mini project_shortest path visualizer.pptx
5.SocialMediaAnalytics2022gggggggggg.pptx
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
A Graph Summarization: A Survey | Summarizing and understanding large graphs
clustering using different methods in .pdf
Spanning Tree in data structure and .pptx
HalifaxNGGs
IRJET- Survey on Implementation of Graph Theory in Routing Protocols of Wired...
Unsupervised Learning.pptx
Hierarchical Clustering
Keynote at AImWD
Ad

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
PDF
What AI Means For Your Product Strategy And What To Do About It
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
PPTX
Enhancing DevEx and Simplifying Operations at Scale
PDF
Spring Update | July 2023
PPTX
Platforms, Platform Engineering, & Platform as a Product
PPTX
Building Cloud Ready Apps
PDF
Spring Boot 3 And Beyond
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
PPTX
tanzu_developer_connect.pptx
PDF
Tanzu Virtual Developer Connect Workshop - French
PDF
Tanzu Developer Connect Workshop - English
PDF
Virtual Developer Connect Workshop - English
PDF
Tanzu Developer Connect - French
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
PDF
SpringOne Tour: The Influential Software Engineer
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Spring into AI presented by Dan Vega 5/14
What AI Means For Your Product Strategy And What To Do About It
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Enhancing DevEx and Simplifying Operations at Scale
Spring Update | July 2023
Platforms, Platform Engineering, & Platform as a Product
Building Cloud Ready Apps
Spring Boot 3 And Beyond
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
tanzu_developer_connect.pptx
Tanzu Virtual Developer Connect Workshop - French
Tanzu Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
Tanzu Developer Connect - French
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Ad

Recently uploaded (20)

PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
Internet Download Manager IDM Crack powerful download accelerator New Version...
PPTX
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
PDF
Sun and Bloombase Spitfire StoreSafe End-to-end Storage Security Solution
PPTX
ROI from Efficient Content & Campaign Management in the Digital Media Industry
PDF
Website Design & Development_ Professional Web Design Services.pdf
PDF
Microsoft Office 365 Crack Download Free
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
PDF
What Makes a Great Data Visualization Consulting Service.pdf
PPTX
R-Studio Crack Free Download 2025 Latest
PPT
3.Software Design for software engineering
PDF
Practical Indispensable Project Management Tips for Delivering Successful Exp...
PPTX
Presentation by Samna Perveen And Subhan Afzal.pptx
PPTX
Chapter 1 - Transaction Processing and Mgt.pptx
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
PDF
infoteam HELLAS company profile 2025 presentation
PDF
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
PDF
AI Guide for Business Growth - Arna Softech
PDF
Guide to Food Delivery App Development.pdf
Airline CRS | Airline CRS Systems | CRS System
Internet Download Manager IDM Crack powerful download accelerator New Version...
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
Sun and Bloombase Spitfire StoreSafe End-to-end Storage Security Solution
ROI from Efficient Content & Campaign Management in the Digital Media Industry
Website Design & Development_ Professional Web Design Services.pdf
Microsoft Office 365 Crack Download Free
Understanding the Need for Systemic Change in Open Source Through Intersectio...
What Makes a Great Data Visualization Consulting Service.pdf
R-Studio Crack Free Download 2025 Latest
3.Software Design for software engineering
Practical Indispensable Project Management Tips for Delivering Successful Exp...
Presentation by Samna Perveen And Subhan Afzal.pptx
Chapter 1 - Transaction Processing and Mgt.pptx
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
MiniTool Power Data Recovery 12.6 Crack + Portable (Latest Version 2025)
infoteam HELLAS company profile 2025 presentation
SOFTWARE ENGINEERING Software Engineering (3rd Edition) by K.K. Aggarwal & Yo...
AI Guide for Business Growth - Arna Softech
Guide to Food Delivery App Development.pdf

Graph Analytics with Greenplum and Apache MADlib

  • 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Graph Analytics with Greenplum and Apache MADlib Pivotal Korea HongDon Lee, Sr. Data Scientist 30th January 2019
  • 2. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Graph Analytics Why Where to use What How
  • 3. Everything is connected! “Nothing ever exists entirely alone. Everything is in relation to everything else (緣起)” “Learn how to see: Everything is connected to everything else” “In nature we never see anything isolated, but everything in connection with something else which is before it, beside it, under it and over it” Buddha Leonardo da vinci Goethe 3
  • 4. What a Small World! “6 Degrees of Separation” 1973, Stanley Milgram, Small-world experiment 1 2 3 4 5 6 4
  • 5. From Reductionism to Holism Reductionism Holism “Divide and Conquer” vs. “Everything has to be understood in relation to the whole” 5
  • 6. From Individual to Relation Time Features 2019.01.01 2019.01.02 2019.01.03 2019.01.04 2019.01.05 2019.01.06 2019.01.30 ... Cross-sectional Perspective Longitudinal Perspective At the individual levelDemographics Behaviors Preferences Economic Status Education Background ... W ho are you? 6
  • 7. From Individual to Relation Time Features Demographics Behaviors Preferences Economic Status Education Background 2019.01.01 ... 2019.01.02 2019.01.03 2019.01.04 2019.01.05 2019.01.06 2019.01.30 ... Cross-sectional Perspective Longitudinal Perspective Relation/ Connection Family Friends Colleagues Community ... “Tell me who your friends are and I’ll tell you who you are” - Mexican Proverb - 7
  • 8. Graph Analytics, one of the Data Scientist’s knifes Graph Analytics t-Test, ANOVA CNN, RNN, GAN Random Forest, XGBoost Bayesian Statistics Regression, Logistic Regression PCA, factor analysis Clustering Text Analysis, NLP Depends on business problem and data 8
  • 9. Network: Everywhere with Everything, All the time MMO Role-Playing Game * www.researchgate.net Chemistry * https://www.nature.com/articles/ Social Network Epidemiology * http://www.netminer.com/community* Grandjean, M. (2016) Bank Risk * https://cambridge-intelligence.com 1st Party Fraud Manufacturing * www.infoglide.com * https://blog.trifinance.com* www.researchgate.net Gene 9
  • 10. Use Cases - PageRank ● Measures the importance of a vertex in a graph by counting the number and quality of the links to that vertex ❏ Web Search ❏ Scientific impact of researchers ❏ Neuroscience ❏ Street and space usage * Image from https://en.wikipedia.org/wiki/PageRank 10
  • 11. Use Cases - Single Source Shortest Path ● Find a path to every vertex so that the sum of the weights of its constituent edges is minimized ❏ Vehicle routing/ navigation ❏ Degrees of separation in a social network ❏ Mid-delay path in a telecommunications network ❏ Plant and facility layout ❏ VLSI (Very-Large-Scale Integration) design 11
  • 12. Use Cases - Cyber-security by Graph model ● Using historical window events data to build historical graphs* of typical user behavior • Which machines does the user log in to? • Which machines does the user log in from? • How often? • In which order? ● Is this behavior typical? • Is it typical for this user? • Is it typical for someone in a particular department? • Is this typical for someone in the user’s job role? ● Graph models are sensitive to direction, order, and frequency. 34.23.123.4 Typical Behavior Anomalous Behavior DB with financial information 34.23.123.51 34.23.1.1 34.23.0.1 34.23.2.8 34.23.123.4 34.23.1.1 34.23.0.1 34.23.2.8 34.23.123.51 *Reference: Alexander D. Kenta, Lorie M. Liebrock, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network. vs. 12
  • 13. Use Cases - Connected Component ● Calculate the Jaccard Dissimilarity Scores for each pair of materials ● If material X and Y are potential duplicates and material Y and Z are potential duplicates then X, Y, Z is a connected component in the graph of all materials and form a cluster ⇒ Connected component analysis resulted in 10% of materials identified as potential duplicates based on their bill of material attributes Z X Y Z Features for each material • part type • material type & group • product line & family • revision key • weld, material & coating specs • quality matrix • unit of measurement • Weight : 13
  • 14. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Graph Analytics Why Where to use What How
  • 15. The Origin of Graph Theory ● Seven bridges of Konigsberg problem ● Leonhard Euler, a mathematician, proved that the problem has no solution “The problem was to devise a walk through the city that would cross each of those bridges once and only once.” Euler, 1753 15
  • 16. [ Terminology of Graph theory ] What is Graph Theory? ● Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 vertice edge weight ● Vertex ● Node ● Point ● Actor V ● Edge ● Link ● Arc ● Line Directed Undirected [ Directed Network Graph with Weight (example) ] 10 Weight E 16
  • 17. Graph Algorithms and Measures Group Structure Centrality Types Question Feasures Path “What are the sub-graphs, component, communities?” “What is the character of the network structure?” “What is the most important vertices within a graph” “What is the shortest path(distance) among vertices” weakly-connected component Density, Diameter, Average path length, Modularity Degree (in/out, weight), Closeness, PageRank, Hub, Authority, Betweenness, Clustering coefficient Single source shortest path, All pairs shortest path, Breadth-First Search Graph-based Features 1 2 3 4 17
  • 18. Graph Algorithms and Measures - (1) Group Group Structure Centrality Path Weakly-Connected Component ● A Connected Component (or just Component) of an undirected graph is subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph * source: Wikipedia [ A supergraph with three connected components ] Component 1 Component 2 Component 3 Supergraph 18
  • 19. D = |E| |V| (|V| - 1) / 2 Graph Algorithms and Measures - (2) Structure Group Structure Centrality Path Density ● A dense graph is a graph in which the number of edges is close to the maximal number of edges. The opposite, a graph with only a few edges, is a sparse graph. The distinction between sparse and dense graphs is rather vague, and depends on the context. * source: Wikipedia ❏ For Undirected simple graphs ❏ For Directed simple graphs D = |E| |V| (|V| - 1) E : the number of Edges, V : the number of Vertices [ Density by components (example) ] D= |6| |4|(|4|-1)/2 D = |3| |4|(|4|-1)/2 =1 =0.5 19
  • 20. Graph Algorithms and Measures - (3) Path Group Structure Centrality Path Single Source Shortest Path (SSSP) ● Given a graph and a source vertex, the Single Source Shortest Path (SSSP) algorithm finds a path from the source vertex to every other vertex in the graph, such that the sum of the weights of the path is minimized. 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ Shortest paths from vertex ‘0’ (example) ] ID weight parent 0 0 0 1 1 0 2 1 0 3 2 (= 1+1) 2 4 10 0 5 2 2 6 3 5 7 4 6 * weight : The total weight of the shortest path from the source vertex to this particular vertex. * parent : The parent of this vertex in the shortest path from source. 23 0 20
  • 21. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path In-Degree, Out-Degree ● The node in-degree is the number of edges pointing in to the node ● The node out-degree is the number of edges pointing out of the node 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ The in-out degree for each node (example) ] ID In-degree Out-degree 0 2 3 1 1 2 2 2 3 3 2 1 4 1 1 5 1 1 6 2 1 7 1 0 21
  • 22. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path PageRank (1 / 2) The size of each face is proportional to the total size of the other faces which are pointing to it. - PR(A): PageRank of node A - N: the total number of Nodes - L(B): the number of Links from node B - d: damping factor (probability, at any step, that a surfer will continue randomly clicking on links) ● PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites * source: Wikipedia 22
  • 23. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path PageRank (2 / 2) A B C D 0.25 0.25 0.25 0.25 1st round 2nd round PR(A) = (1-0.85)/4 + 0.85*(0.25/2 + 0.25/1 + 0.25/3) = 0.427 PR(B) = (1-0.85)/4 + 0.85*(0.25/3) = 0.108 PR(C) = (1-0.85)/4 + 0.85*(0.25/2 + 0.25/3) = 0.214 PR(D) = (1-0.85)/4 + 0.85*0 = 0.037 PR(A) = 0.25 PR(B) =0.25 PR(C) = 0.25 PR(D) = 0.25 Final round A B C D 0.108 0.214 0.037 0.427 ... Recursive calculation → converged A B C D 0.048 0.069 0.038 0.127 * PR(A): PageRank of node A, N: the total number of Nodes, L(B): the number of Links from node B, d: damping factor (typically 0.85) 23
  • 24. Graph Algorithms and Measures - (4) Centrality Group Structure Centrality Path Closeness ● Closeness of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph (ie, based on All Pairs Shortest Path) 0 1 2 4 3 5 6 7 1 2 10 1 10 1 1 3 1 -2 1 1 [ All Pairs Shortest Path ] source destination weight 0 0 0 0 1 1 0 2 1 0 3 2 0 4 10 0 5 2 0 6 3 0 7 4 1 0 4 1 1 0 1 2 2 1 3 3 1 4 14 1 5 3 1 6 4 1 7 5 2 0 2 ... [ Closeness Centrality ] src_id closeness 0 0.043 1 0.028 2 0.041 3 0.035 * N: The number of nodes, * d(y, x) : The distance between y and x node 24
  • 25. Big Issue of Graph Algorithms - High Complexity * image source: https://www.xkcd.com/399/ 25
  • 26. Big Issue of Graph Algorithms - High Complexity Type Algorithms/ Measures Time Complexity Group Weakly-Connected Component O(|V| + |E|) Structure Density O(|V|+|E|log|E|) Diameter O(|V|3 ) Path All Pairs Shortest Path O(|V|3 ) Single Source Shortest Path O(|V|2 ) Breadth-First Search O(E + V) Centrality In-Degree, Out-Degree O(|V| + |E|) Closeness Centrality O(|V|3 ) PageRank O(log(network size)/(1-damping factor)) Betweenness Centrality O(|V|2 log|V|+|V||E|) * |V|: the number of Vertices in graph * |E|: the number of Edges in graph ● Computationally Intensive - exponential to the number of vertices and edges Graph Analysis at Scale, parallel processing with MADlib on Greenplum 26
  • 27. Agenda 1. Why Graph Analytics? 2. What is Graph Analytics? 3. Graph Analytics w/ MADlib Why Where to use What How Graph Analytics
  • 28. Tools for Graph Analytics ● Graph Analytics at Scale with Open Source MADlib on Greenplum Commercial Open Source Small Data Big Data/ Parallel Processing Data Size & Processing WhetherOSSornot sna, igraph, ergm, network NetworkX, graph-tool, SNAP, pygraphviz ... Interactive visualization focused Graph DB 28
  • 29. Analytics Platform, GPDB Graph Analytics at Scale ● Designed for very large graphs (billions of vertices/edges) ● No need to move data and transform for external graph engine - One analytics database to deploy and manage ● Familiar SQL interface ● Combine context-based graph analytics with other content-based techniques ❏ Advanced Analytics In Database Extended Language GPText ❏ Scale Out ❏ MPP (Massively Parallel Processing) Architecture REGRESSIONCLASSIFICATIONCLUSTERING GEOSPATIAL GRAPHTEXT IMAGE Graph Analytics with MADlib on Greenplum 29
  • 30. : Scaleable, In-Database Machine Learning ● Open source https://github.com/apache/madlib ● Downloads and docs http://madlib.apache.org/ ● Wiki https://cwiki.apache.org/confluence/display/MADLIB/ Apache MADlib: Big Data Machine Learning in SQL Open source, top level Apache project For PostgreSQL and Greenplum Database Powerful machine learning, graph, statistics and analytics for data scientists 30
  • 31. : Functions Data Types and Transformations Array and Matrix Operations Matrix Factorization • Low Rank • Singular Value Decomposition (SVD) Norms and Distance Functions Sparse Vectors Encoding Categorical Variables Path Functions Pivot Sessionize Stemming April 2018 Graph All Pairs Shortest Path (APSP) Breadth-First Search Hyperlink-Induced Topic Search (HITS) Average Path Length Closeness Centrality Graph Diameter In-Out Degree PageRank and Personalized PageRank Single Source Shortest Path (SSSP) Weakly Connected Components Model Selection Cross Validation Prediction Metrics Train-Test Split Statistics Descriptive Statistics • Cardinality Estimators • Correlation and Covariance • Summary Inferential Statistics • Hypothesis Tests Probability Functions Supervised Learning Neural Networks Support Vector Machines (SVM) Conditional Random Field (CRF) Regression Models • Clustered Variance • Cox-Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Naïve Bayes • Ordinal Regression • Robust Variance Tree Methods • Decision Tree • Random Forest Time Series Analysis • ARIMA Unsupervised Learning Association Rules (Apriori) Clustering (k-Means) Principal Component Analysis (PCA) Topic Modelling (Latent Dirichlet Allocation) Utility Functions Columns to Vector Conjugate Gradient Linear Solvers • Dense Linear Systems • Sparse Linear Systems Mini-Batching PMML Export Term Frequency for Text Vector to Columns Nearest Neighbors • k-Nearest Neighbors Sampling Balanced/ Random/ Stratified Sampling 31
  • 32. : Graph Representation in MADlib Source Vertex Dest Vertex Edge Weight Edge Params 0 3 1.0 ... 1 0 5.0 ... 1 2 3.0 ... 2 3 8.0 ... 3 0 3.0 ... 3 1 2.0 ... Vertex Vertex Params 0 ... 1 ... 2 ... 3 ... . . . . . . Vertex Table Edge Table ... ... 0 1 2 3 5 3 8 2 1 3 [ Directed Graph (example) ] V 32
  • 33. example : PageRank in MADlib ● Create vertex and edge tables to represent the graph * http://madlib.apache.org/docs/latest/group__grp__pagerank.html DROP TABLE IF EXISTS vertex; CREATE TABLE vertex( id INTEGER ); INSERT INTO vertex VALUES (0), (1), (2), (3), (4), (5), (6); DROP TABLE IF EXISTS edge; CREATE TABLE edge( src INTEGER, dest INTEGER, user_id INTEGER ) DISTRIBUTED BY (user_id); INSERT INTO edge VALUES (0, 1, 1), (0, 2, 1), -- user id 1 (0, 4, 1), (1, 2, 1), (1, 3, 1), (2, 3, 1), (2, 5, 1), (2, 6, 1), (3, 0, 1), (4, 0, 1), (5, 6, 1), (6, 3, 1), (0, 1, 2), (0, 2, 2), -- user id 2 (0, 4, 2), (1, 2, 2), (1, 3, 2), (2, 3, 2), (3, 0, 2), (4, 0, 2), (5, 6, 2), (6, 3, 2); 33
  • 34. example : PageRank in MADlib ● Compute the PageRank with All IDs DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_out' -- Output table of RageRank , NULL); -- Damping factor (default 0.85) SELECT * FROM pagerank_out ORDER BY pagerank DESC; 34
  • 35. example : PageRank in MADlib ● Network Diagram by Graphviz and PyGraphviz * PyGraphviz is a Python interface to the Graphviz graph layout and visualization package A vertex with a high PageRank is usually considered more "important" or more "influential" or more "relevant" than a vertex with a low PageRank. Size of node is proportional to PageRank value User ID 1 User ID 2 ID (PageRank) 35
  • 36. example : PageRank in MADlib ● PageRank of vertices associated with each user by the grouping feature DROP TABLE IF EXISTS pagerank_gr_out, pagerank_gr_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_gr_out' -- Output table of PageRank , NULL -- Default damping factor (0.85) , NULL -- Default max iterations (100) , 0.00000001 -- Threshold , 'user_id' -- Grouping column name ); SELECT * FROM pagerank_gr_out ORDER BY user_id, pagerank DESC; PageRank of user id 1 PageRank of user id 2 36
  • 37. example : PageRank in MADlib ● Personalized PageRank of vertices {2, 4}, for Recommendations DROP TABLE IF EXISTS pagerank_pers_out, pagerank_pers_out_summary; SELECT madlib.pagerank( 'vertex' -- Vertex table , 'id' -- Vertex id column , 'edge' -- Edge table , 'src=src, dest=dest' -- Comma delimited string of edge arguments , 'pagerank_pers_out' -- Output table of PageRank , NULL -- Default damping factor (0.85) , NULL -- Default max iterations (100) , NULL -- Default Threshold (1/number of vertices*1000) , NULL -- No Grouping , '{2, 4}' -- Personalization vertices ); SELECT * FROM pagerank_pers_out ORDER BY pagerank DESC; SELECT * FROM pagerank_pers_out_summary; * Personalized PageRank = (1-p)*Ax + p*E , where ‘E’ is the list of vertices for personalized PageRank, ‘p’ is the damping factor 37
  • 38. example : PageRank in MADlib Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Normal random graphs with mean degrees 50 edges per vertex (i.e., 5B edges in the largest case) 5B edges (1K) (10K) (100K) (1M) (10M) (100M) * Note: log-log scale (100s) (1s) (10K s) (1M s) PageRank Performance on Greenplum w/ MADlib 38
  • 39. In Summary ● Capture the Relationship in Networks using Graph Analytics → Community, Structure, Path, Centrality → Combine context-based graph analytics with other content-based insights ● Graph analytics at SCALE with Open Source Software → Apache MADlib on Greenplum, massively parallel processing 39
  • 40. One more thing... GREENPLUM SUMMIT at PostgresConf 2019 by Pivotal 40
  • 41. Transforming How The World Builds Software © Copyright 2019 Pivotal Software, Inc. All rights Reserved.