Copyright © ArangoDB GmbH, 2017 -
Confidential
+ +
Handling Billions Of Edges in
a Graph Database
1
‣ Michael Hackstein
‣ ArangoDB Core Team
‣ Graph visualisation
‣ Graph features
‣ SmartGraphs
‣ Host of cologne.js
‣ Master’s Degree (spec. Databases and
Information Systems)
{
name: "alice",
age: 32
}
{
name: "dancing"
}
{
name: "bob",
age: 35,
size: 1,73m
}
{
name: "reading"
}
{
name: "fishing"
}
hobby
hobby
hobby
hobby
‣ Schema-free Objects (Vertices)
‣ Relations between them (Edges)
‣ Edges have a direction
‣ Edges can be queried in both directions
‣ Easily query a range of edges (2 to 5)
‣ Undefined number of edges (1 to *)
‣ Shortest Path between two vertices
BobBob
Charly DaveCharly Dave
‣ Give me all friends of Alice
Alice
Eve FrankEve Frank
FrankFrankEveEve
Alice
Bob
Charly Dave
‣ Give me all friends-of-friends of Alice
Bob
Charly Dave
Eve BobEve
AliceCharly Dave
Frank
‣ What is the linking path between Alice and Eve
You are
here
‣ Which Train Stations can I reach if I am allowed to drive a distance of at most 6
stations on my ticket
FriendFriend
‣ Give me all users that share two hobbies with Alice
Alice
Hobby1 Hobby2
ProductAlice
Product
Friend
‣ Give me all products that at least one of my friends has bought together with
the products I already own, ordered by how many friends have bought it and
the products rating, but only 20 of them.
has_bought
has_bought
has_boughtis_friend
Product
‣ Give me all users which have an age attribute between 21 and 35.
‣ Give me the age distribution of all users
‣ Group all users by their name
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
We apply filters on edges
Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
We apply filters on edges
The next vertex (F) is in desired depth.
Return the path S -> B -> F
Traversal: Complexity
Once: Operation Comment O
Find the start vertex Depends on indexes: Hash: 1
For every
depth:
Find all connected edges Edge-Index or Index-Free: 1
Filter non-matching edges Linear in edges: n
Find connected vertices Depends on indexes: Hash: n · 1
Filter non-matching vertices Linear in vertices: n
Total for
one pass:
3n
Traversal: Complexity
Linear sounds evil?
Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
BUT: Every depth increases the exponent: O((3n)d)
Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
BUT: Every depth increases the exponent: O((3n)d)
“7 degrees of separation”: n6 < E < n7
‣ MULTI-MODEL database
‣ Stores Key Value, Documents, and Graphs
‣ All in one core
‣ Query language AQL
‣ Document Queries
‣ Graph Queries
‣ Joins
‣ All can be combined in the same statement
‣ ACID support including Multi Collection Transactions
+ +
FOR user IN users
RETURN user
FOR user IN users
FILTER user.name == "alice"
RETURN user
Alice
FOR user IN users
FILTER user.name == "alice"
FOR product IN OUTBOUND user has_bought
RETURN product
Alice
has_bought
TV
FOR user IN users
FILTER user.name == "alice"
FOR recommendation, action, path IN 3 ANY user has_bought
FILTER path.vertices[2].age <= user.age + 5
AND path.vertices[2].age >= user.age - 5
FILTER recommendation.price < 25
LIMIT 10
RETURN recommendation
Alice
has_bought
TV
has_bought
playstation.price < 25
PlaystationBob
alice.age - 5 <= bob.age &&
bob.age <= alice.age + 5
has_bought
‣ Many graphs have "celebrities"
‣ Vertices with many inbound and/or outbound edges
‣ Traversing over them is expensive (linear in number of Edges)
‣ Often you only need a subset of edges
Bob Alice
‣ Remember Complexity? O(3 * nd
)
‣ Filtering of non-matching edges is linear for every depth
‣ Index all edges based on their vertices and arbitrary other attributes
‣ Find initial set of edges in identical time
‣ Less / No post-filtering required
‣ This decreases the n significantly
Alice
‣ We have the rise of big data
‣ Store everything you can
‣ Dataset easily grows beyond one machine
‣ This includes graph data!
Scaling horizontally
Distribute graph on several machines (sharding)
Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
In a sharded environment the network most of the time is the bottleneck
Reduce network hops
Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
In a sharded environment the network most of the time is the bottleneck
Reduce network hops
Vertex-Centric Indexes again help with super-nodes
But: Only on a local machine
Random distribution
Advantages:
every server takes an equal portion
of the data
easy to realize
no knowledge about data required
always works
Disadvantages:
Neighbors on different machines
Probably edges on other machines
than their vertices
A lot of network overhead is
required for querying
Random distribution
Advantages:
every server takes an equal portion
of the data
easy to realize
no knowledge about data required
always works
Disadvantages:
Neighbors on different machines
Probably edges on other machines
than their vertices
A lot of network overhead is
required for querying
Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
Smart graphs: this is how it works
Smart graphs: this is how it works
Smart graphs: this is how it works
‣ ArangoDB uses a hash-based edge index (O(1) - lookup)
‣ The vertex is independent of its edges
‣ It can be stored on a different machine
‣ Used by most other graph databases
‣ Every vertex maintains two lists of it's edges (IN and OUT)
‣ Do not use an index to find edges
‣ How to shard this?
????
‣ Further questions?
‣ Follow us on twitter: @arangodb
‣ Join our slack: slack.arangodb.com
‣ Follow me on twitter/github: @mchacki
Thank You

Scaling to billions of Edges in a Graph Database by Max Neunhoeffer at Big Data Spain 2017

  • 2.
    Copyright © ArangoDBGmbH, 2017 - Confidential + + Handling Billions Of Edges in a Graph Database 1
  • 3.
    ‣ Michael Hackstein ‣ArangoDB Core Team ‣ Graph visualisation ‣ Graph features ‣ SmartGraphs ‣ Host of cologne.js ‣ Master’s Degree (spec. Databases and Information Systems)
  • 4.
    { name: "alice", age: 32 } { name:"dancing" } { name: "bob", age: 35, size: 1,73m } { name: "reading" } { name: "fishing" } hobby hobby hobby hobby ‣ Schema-free Objects (Vertices) ‣ Relations between them (Edges) ‣ Edges have a direction ‣ Edges can be queried in both directions ‣ Easily query a range of edges (2 to 5) ‣ Undefined number of edges (1 to *) ‣ Shortest Path between two vertices
  • 5.
    BobBob Charly DaveCharly Dave ‣Give me all friends of Alice Alice Eve FrankEve Frank
  • 6.
    FrankFrankEveEve Alice Bob Charly Dave ‣ Giveme all friends-of-friends of Alice Bob Charly Dave
  • 7.
    Eve BobEve AliceCharly Dave Frank ‣What is the linking path between Alice and Eve
  • 8.
    You are here ‣ WhichTrain Stations can I reach if I am allowed to drive a distance of at most 6 stations on my ticket
  • 9.
    FriendFriend ‣ Give meall users that share two hobbies with Alice Alice Hobby1 Hobby2
  • 10.
    ProductAlice Product Friend ‣ Give meall products that at least one of my friends has bought together with the products I already own, ordered by how many friends have bought it and the products rating, but only 20 of them. has_bought has_bought has_boughtis_friend Product
  • 11.
    ‣ Give meall users which have an age attribute between 21 and 35. ‣ Give me the age distribution of all users ‣ Group all users by their name
  • 12.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S)
  • 13.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S
  • 14.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges
  • 15.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A)
  • 16.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges
  • 17.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges The next vertex (E) is in desired depth. Return the path S -> A -> E
  • 18.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges The next vertex (E) is in desired depth. Return the path S -> A -> E Go back to the next unfinished vertex (B)
  • 19.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges The next vertex (E) is in desired depth. Return the path S -> A -> E Go back to the next unfinished vertex (B) We iterate down on (B)
  • 20.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges The next vertex (E) is in desired depth. Return the path S -> A -> E Go back to the next unfinished vertex (B) We iterate down on (B) We apply filters on edges
  • 21.
    Traversal: Iterate downtwo edges with some filters We first pick a start vertex (S) We collect all edges on S We apply filters on edges We iterate down one of the edges to (A) We apply filters on edges The next vertex (E) is in desired depth. Return the path S -> A -> E Go back to the next unfinished vertex (B) We iterate down on (B) We apply filters on edges The next vertex (F) is in desired depth. Return the path S -> B -> F
  • 22.
    Traversal: Complexity Once: OperationComment O Find the start vertex Depends on indexes: Hash: 1 For every depth: Find all connected edges Edge-Index or Index-Free: 1 Filter non-matching edges Linear in edges: n Find connected vertices Depends on indexes: Hash: n · 1 Filter non-matching vertices Linear in vertices: n Total for one pass: 3n
  • 23.
  • 24.
    Traversal: Complexity Linear soundsevil? NOT linear in all edges O(E) Only linear in relevant edges n < E
  • 25.
    Traversal: Complexity Linear soundsevil? NOT linear in all edges O(E) Only linear in relevant edges n < E Traversals solely scale with their result size
  • 26.
    Traversal: Complexity Linear soundsevil? NOT linear in all edges O(E) Only linear in relevant edges n < E Traversals solely scale with their result size They are not effected at all by total amount of data
  • 27.
    Traversal: Complexity Linear soundsevil? NOT linear in all edges O(E) Only linear in relevant edges n < E Traversals solely scale with their result size They are not effected at all by total amount of data BUT: Every depth increases the exponent: O((3n)d)
  • 28.
    Traversal: Complexity Linear soundsevil? NOT linear in all edges O(E) Only linear in relevant edges n < E Traversals solely scale with their result size They are not effected at all by total amount of data BUT: Every depth increases the exponent: O((3n)d) “7 degrees of separation”: n6 < E < n7
  • 29.
    ‣ MULTI-MODEL database ‣Stores Key Value, Documents, and Graphs ‣ All in one core ‣ Query language AQL ‣ Document Queries ‣ Graph Queries ‣ Joins ‣ All can be combined in the same statement ‣ ACID support including Multi Collection Transactions + +
  • 30.
    FOR user INusers RETURN user
  • 31.
    FOR user INusers FILTER user.name == "alice" RETURN user Alice
  • 32.
    FOR user INusers FILTER user.name == "alice" FOR product IN OUTBOUND user has_bought RETURN product Alice has_bought TV
  • 33.
    FOR user INusers FILTER user.name == "alice" FOR recommendation, action, path IN 3 ANY user has_bought FILTER path.vertices[2].age <= user.age + 5 AND path.vertices[2].age >= user.age - 5 FILTER recommendation.price < 25 LIMIT 10 RETURN recommendation Alice has_bought TV has_bought playstation.price < 25 PlaystationBob alice.age - 5 <= bob.age && bob.age <= alice.age + 5 has_bought
  • 34.
    ‣ Many graphshave "celebrities" ‣ Vertices with many inbound and/or outbound edges ‣ Traversing over them is expensive (linear in number of Edges) ‣ Often you only need a subset of edges Bob Alice
  • 35.
    ‣ Remember Complexity?O(3 * nd ) ‣ Filtering of non-matching edges is linear for every depth ‣ Index all edges based on their vertices and arbitrary other attributes ‣ Find initial set of edges in identical time ‣ Less / No post-filtering required ‣ This decreases the n significantly Alice
  • 36.
    ‣ We havethe rise of big data ‣ Store everything you can ‣ Dataset easily grows beyond one machine ‣ This includes graph data!
  • 37.
    Scaling horizontally Distribute graphon several machines (sharding)
  • 38.
    Scaling horizontally Distribute graphon several machines (sharding) How to query it now? No global view of the graph possible any more What about edges between servers?
  • 39.
    Scaling horizontally Distribute graphon several machines (sharding) How to query it now? No global view of the graph possible any more What about edges between servers? In a sharded environment the network most of the time is the bottleneck Reduce network hops
  • 40.
    Scaling horizontally Distribute graphon several machines (sharding) How to query it now? No global view of the graph possible any more What about edges between servers? In a sharded environment the network most of the time is the bottleneck Reduce network hops Vertex-Centric Indexes again help with super-nodes But: Only on a local machine
  • 41.
    Random distribution Advantages: every servertakes an equal portion of the data easy to realize no knowledge about data required always works Disadvantages: Neighbors on different machines Probably edges on other machines than their vertices A lot of network overhead is required for querying
  • 42.
    Random distribution Advantages: every servertakes an equal portion of the data easy to realize no knowledge about data required always works Disadvantages: Neighbors on different machines Probably edges on other machines than their vertices A lot of network overhead is required for querying
  • 43.
    Domain-based Distribution Many Graphshave a natural distribution By country/region for People By tags for Blogs By category for Products Most edges in the same group Rare edges between groups
  • 44.
    Domain-based Distribution Many Graphshave a natural distribution By country/region for People By tags for Blogs By category for Products Most edges in the same group Rare edges between groups
  • 45.
    Domain-based Distribution Many Graphshave a natural distribution By country/region for People By tags for Blogs By category for Products Most edges in the same group Rare edges between groups
  • 46.
    Smart graphs: thisis how it works
  • 47.
    Smart graphs: thisis how it works
  • 48.
    Smart graphs: thisis how it works
  • 49.
    ‣ ArangoDB usesa hash-based edge index (O(1) - lookup) ‣ The vertex is independent of its edges ‣ It can be stored on a different machine ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this? ????
  • 50.
    ‣ Further questions? ‣Follow us on twitter: @arangodb ‣ Join our slack: slack.arangodb.com ‣ Follow me on twitter/github: @mchacki Thank You