Exalead managing terrabytes

Content

• Introduc*on

• Databases

– ACID

– Data
structures,
algorithms

– Scalability
issues

– Scaling
pa=erns

• Search
engines

– Data
structures,
algorithms

– Pros
&
cons

• NoSQL
Movement

– Why
and
What

1

Content

• NoSQL
Families

– Key
value
stores

– Column
stores

– Document
stores

– Graph
DB

• Principles:
CAP,
Scaling
pa=erns,
High
availability

pa=erns,
Elas*city

• How
to
choose
?

• Conclusion

2

Introduc,on

• Who
we
are:

– Clément
STENAC
(Indexing
and
search
techs)

– Jérémie
BORDIER
(360
team
(a
bit
of
everything))

• Exalead:

– Indexing
technologies
provider
since
1998

– Online
search
engine:
h=p://www.exalead.com

– Daily
challenge:
Tackle
informa*on
access

problems
for
large
companies.

3

Introduc,on

• Universal
answer
to
data
storage:

RELATIONAL
DATABASES

• Well
known
data
representa*on:
Objects

and
rela*onships

• Powerful
query
language:
SQL

• Open
source
implementa*ons:

– MySQL

– PostgreSQL

– …

4

Introduc,on

• Database
scalability
problems
?

• Used
to
be
a
Telco
and
bank
problem…

• Un*l
the
internet
has
come
!

Twitter whale, 2008
5

Introduc,on

• Thanks
to
the
internet…

• …millions
of
rows
is
frequent…

• …
real
*me
websites.

How
to
deal
with
massive
amount
of

structured
data
?
Are
there
alterna*ves
?

What’s
this
NoSQL
buzz
?

6

Knowing
your
enemy:

RELATIONAL
DATABASES

7

Databases:
ACID

ACID
constraints

• Atomicity

• Transac*ons
succeed
or
fail
atomically

• Consistency

• Transac*ons
leave
the
database
in
a
consistent

state

• Isola,on

• Transac*ons
do
not
see
the
eﬀects
of
concurrent

transac*ons

• Durability

• Once
a
transac*on
is
commi=ed,
it
can’t
be
lost

Database
structures

Primary
storage

CREATE TABLE author (
Heuris*cs
change
it
id INTEGER PRIMARY KEY,
nick VARCHAR(16), Fixed size
to
variable-‐size
age INTEGER,
firstname VARCHAR(128),
biography TEXT); Variable size
CREATE TABLE post (
Each
value
or
pointer
id INTEGER PRIMARY KEY,
can
be
retrieved
at
a
author_id FOREIGN KEY REFERENCES author(id);
timestamp TIMESTAMP,
known
oﬀset
in
the
row

title VARCHAR(256),
text TEXT);

Id age nick firstname biography
Row 1 4 bytes 4 bytes 16 bytes pointer pointer

Id age nick firstname biography
Row 2 4 bytes 4 bytes 16 bytes pointer pointer

Table strings len data len data len data len data

Searching
in
a
database

SELECT * FROM author WHERE age=24;

The
raw
way:
full
scan

• Enumerate
all
records
in
the
table

• For
each
record,
fetch
the
condi*on
value

• Inline
value:
direct
access
at
row_address + offset(column)
• Outside
value
:
fetch
pointer
and
fetch
data

• Perform
comparison

Analysis

• Need
to
analyse
the
full
table

• Very
CPU
intensive

• If
the
table
does
not
ﬁt
in
memory
?
–
I/O
on
the
whole
table

Database
structures

Indexes

What
is
an
index
?

• Primary
storage:
forward
mapping

row_id –> row data
• Index
:
reverse
mapping

row data –> row_id(s)
• Updated
together
with
the
primary
storage

Searching
with
an
index

• Retrieve
the
row
ids
using
the
index

• Fetch
the
row
data
from
primary
storage

Database
structures

Indexes
–
Hash
index

How
it
works

• Stores
hashes
of
column
values
in
as
hash-‐table

• Retrieve
through
the
hash
table

Pros

• Very
easy
and
fast
to
update

• Fast
lookup
–
single
hashtable
lookup

Cons

• Only
provides
equality
matching

• Unable
to
answer
inequality
queries

Database
structures

Indexes
–
BTree
index

Binary search tree B-Tree

Pros

• Provides
range
and
inequality
queries
easily

• Quite
fast
(logarithmic)
opera*ons

Cons

• More
complex

and
expensive
to
update

• B-‐Tree
rebalancing

Choosing
how
to
search

Is
indexed
search
always
be=er
?

• SELECT * from author where age < 300;

Analysis

• Fetch
of
whole
table

• Index:
random
lookups

• Full
scan
:
sequen*al
fetch

Choosing
wisely

• Iden*fy
the
expensive
queries

• Use
the
EXPLAIN
statement

• Only
add
indexes
where
they
are
required

• Indexes
are
expensive
to
update

Joining

Goal

• Put
together
data
from
several
tables

• For
some
values
in
table
A,
ﬁnd
matching
values

in
table
B

Example

• ELECT * FROM post
S
INNER JOIN author
ON author.id = post.author_id
WHERE author.age = 42;

Join
algorithms

Nested
loops

• Foreach (author WHERE age=42) {
Foreach(post) {
if (post.author_id == author.id) {
append post to the result set;
}
}
}
• Very
naive
algorithm
:
runs
in
PxA
*me

• Provides
all
predicates

Hash
join

• Algorithm

• Make
a
hashtable
of
author
ids
matching
the
«
age
=
42
»
condi*on

• Scan
once
the
post
table

• For
each
post,
lookup
in
the
hashtable
to
check
if
it
matches
a
valid
author

• Faster
than
nested
loops
(2
scans
instead
of
A)

• Requires
memory
to
store
the
hashtable

• Only
provides
equality
predicate

Join
algorithms

Merge
join

• Need
to
have
both
tables
sorted
by
join
key

• Post
sorted
by
author_id

• Author
sorted
by
id

• Perform
a
single
parallel
scan
of
the
two
tables
and
iden*fy
matches

• Fastest
algorithm,
but
needs
sorted
data

• Disk-‐based
sort
for
large
data
sets

Choice
of
join
algorithm

• Performed
automa*cally
by
the
query
op*mizer
(EXPLAIN)

• Main
parameters:

• Rela*ons
cardinali*es

• Data
order
(presence
of
an
ORDER
BY
clause
?)

• Available
indexes

• JOIN
are
always
expensive
-‐>
schema
denormaliza,on

Database
scaling

Typical
workloads

Mostly
read
workloads

• Example:
Wikipedia

• First
solu*on:
high-‐level
(frontend
*er)
caching

• Database
scaling
:
1
master
–
N
slaves

• Replica,on
of
changes
from
master
to
slaves

• Does
not
solve
the
write
bo=leneck
problem

High
write
workloads

• Examples:
credit
cards,

Twi=er
(>1000
tweets/second,
1000s
of
deliveries)

• Performance
limited
by
write
I/O
throughput

• Because
of
the
«
D
»
constraint

• Hard
to
have
more
than
1000-‐2000
writes/second

Database
scaling

Scaling
writes

Mul*ple
master
setups

• All
masters
have
the
same
data
and
share
the
updates

• «
share-‐all
»
cluster
architecture

• Extremely
complex
synchroniza*on

• Bi-‐direc*onal
replica*on

• Conﬂict
detec*on

• Bad
performance

• Complex
resilience

• Down*me
of
a
master:
need
a
resync

• Complex,
heavy
and
expensive
architectures

Bi-directional
Client 1 Master replication flow
Master Client 2
1 2

Database
scaling

Scaling
writes

Sharding

• Split
the
data
between
the
masters
based
on
a

criterion

• Date

• User
id

•
hash(url),
…

• Clients
query
the
correct
master
for
each
data

• No
shared
data
between
masters
(«
share-‐nothing
»)

Client 1
Master Master
1 2
Client 2

Database
scaling

Problems
with
SQL
sharding

Complexity

• Not
integrated
in
SQL

• Need
to
perform
the
sharding
in
applica*ve
code

Resilience

• Several
machines
but
no
resilience

• Loss
of
one
master
=
loss
of
data
(compare
to
RAID-‐0)

Loss
of
features

• You
can’t
do
cross-‐shard
joins

Complex
evolu*ons

• How
do
you
keep
scaling
?

• To
add
another
machine,
you
need
to
change
the
distribu*on
func*on

Database
scaling

Other
SQL
shortcomings

Strict
schema

• It
is
good,
it
provides
strong
typing

• But,
migra*on
hell
!

• Web
applica*ons
changes
quickly

• Not
«
Agile
»

On
the
other
side:

SEARCH
ENGINES

23

A
quick
look
at
search
engines

Diﬀerences
from
a
tradi*onal
database

• Not
designed
for
OLTP

• Update
by
batches

• No
transac*ons,
updates
are
available
to
readers

«
later
»

• Heavily
read-‐op*mized

Full
text
search

• It’s
more
complex
than

LIKE ’%myword%’;
• Need
speciﬁc
data
structures

Search
engines

Inverted
lists

What
is
is

• A
data
structure
mapping
a
«
word
iden*ﬁer
»
to
a
list
of
«
document

iden*ﬁer
»

• For
each
word
of
each
document,
store
the
posi*ons

Document
1
List
for
word
3
(fox)

List
for
word
1
(the)
• doc
1
(at
posi*on
2)

The
quick
fox

• doc
1
(at
posi*on
0)

• the
=
1

• doc
2
(at
posi*on
0)

Document
2
• quick
=
2

• doc
3
(at
posi*on
0)

List
for
word
4
(lazy)

• fox
=
3

The
lazy
dog
• lazy
=
4

• doc
2
(at
posi*on
1)

• dog
=
5

List
for
word
2
(quick)

Document
3
• doc
1
(at
posi*on
1)

• doc
3
(at
posi*on
2)

List
for
word
5
(dog)

• doc
2
(at
posi*on
2)

The
dog
quick
dog
• doc
3
(at
posi*ons
1,
3)

Exalead S.A. © 2010
CONFIDENTIAL

Search
engines

Searching
with
inverted
lists

Single
word
query
:
dog

• Resolve
the
word
to
its
id
using
the
dic*onary
(wid
5)

• Fetch
the
inverted
list
for
this
id

• Simply
read
the
inverted
list
for
its
id

• We
have
the
hits:
document
2
and
document
3

Boolean
query:
the
AND
dog

• Resolve
words,
fetch
inverted
lists

• The: 1,2,3 Dog: 2,3
• Perform
intersec*on:

hits
=
2,3

Boolean
query
:
the
OR
dog

• Resolve/fetch

• Perform
union:
hits
=
1,
2,
3

CONFIDENTIAL

Search
engines

Searching
with
inverted
lists

Posi*onal
query:
the
NEXT
dog

• Fetch
the
inverted
lists
and
also
read
the
posi*ons

• The : 1(0), 2(0), 3(0)
Dog : 2(2), 3(1,3)
• Iden*fy
“simple
boolean”
matches:
docs

2
and
3

• For
each
possible
match,

check
if
posi*ons
form
a

sequence

• Only
document
3
matches
on
sequence
(0,1)

• Posi*onal
queries
are
more
expensive
and
storing

word
posi*ons
is
expensive
(disk
space,
decoding

CPU,
I/O)

CONFIDENTIAL

The
revolu*on:

THE
NOSQL
MOVEMENT

28

NoSQL
Movement

• «
NoSQL
»
©
Eric
VANS
(Rackspace,
2009)

The
name
was
an
a=empt
to
describe
the

emergence
of
a
growing
number
of
non-‐
rela*onal,
distributed
data
stores
that
ozen
did

not
a=empt
to
provide
ACID
guarantees.
Wikipedia

29

NoSQL
Movement:
Issue

• RDBMS
fails
with
huge
amount
of
data

– Facebook’s
70TB
of
inbox

– Digg’s
3TB

– eBay’s
2PB…

• High
scale
SQL
systems
are
either:

– Very
expensive
to
buy
and
quite
to
maintain

– Very
expensive
to
maintain

30

NoSQL
Movement

• We
need
new
systems
that:

– Scales
horizontally
(both
read/write)

– Have
no
single
point
of
failure

– Are
fault
tolerant

– Are
elas*cs
(adding
nodes
is
easy)

– Have
ﬂexible
data
schemas

– Are
more
web
applica*ons
friendly

31

NoSQL:
Families

• Diﬀerent
types
of
data
stores:

– Key-‐Value
stores
(Dynamo,
Redis,
Voldemort…)

– Column
stores
(BigTable,
Cassandra,
HBase…)

– Document
stores
(CouchDB,
MongoDB…)

– Graph
stores
(Neo4J,
Swarm…)

32

NoSQL:
Key-‐Value
stores

• Distributed
hashtables

– Btrees

– Fixed
sized
tables

• Beneﬁts:

– Very
simple
API
(get/put/delete/range)

– Easily
shardable

– Fast
reads

• Drawbacks:

– No
data
schema
(no
joins,
data
ﬂa=ening…)

– No
query
language

• Implems:
Redis,
Amazon
Dynamo,
Voldemort

33

NoSQL:
Column
Stores

Id
Lastname
Firstname
Salary

1
Smith
Joe
40000

2
Jones
Mary
50000

3
Johnson
Cathy
44000

• Row
based
storage:

– 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;

• Column
based
storage:

– 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;

34

NoSQL:
Column
Stores

• Beneﬁts:

– Reading
all
the
values
of
a
given
column
is

faster
(ex:
aggregates)

– Batch
writes
are
faster

• Joins
are
faster

– Comparing
two
columns
is
sequen*al

– Much
more
L1
CPU
cache
hits

– L1
cache
reference:
0.5ns

– L2
cache
reference:
7ns

35

NoSQL:
Column
Stores

• Drawbacks:

– Reading
a
single
object
is
slower
(mul*
ios)

– Wri*ng
a
single
object
is
slower
(mul*
ios)

– Doesn’t
ﬁt
to
most
applica*ons

• Finally:

– Well
suited
for
heavy
write
/
read
applica*ons

• (eg:
Facebook
inbox
indexes)

36

NoSQL:
Document
Stores

• Can
be
seen
as
schema
free,
hierarchical

database
(usually
represented
as
JSON)

SQL Schema: Document store:
Person:

-‐
id
Person: - name

-‐
id 1
-‐
address

-‐
id

- name Animal: - phone
-‐
person_id

-‐
address

-‐
id
- animals =
-‐
name

- phone N - person_id
-‐
address

- name
-‐
phone

-‐
address

- phone

37

NoSQL:
Document
Stores

• Benefits:

– Data
spa*ality
!
Everything
in
one
place

– Efficient
write
and
updates
(in
place)

– Efficient
read

– Highly
flexible
data
schema

– Usually
provides
indexes
over
each
object
key

to
have
powerful
query
language

• Drawbacks

– Doesn’t
encourage
well
designed
data
schema

38

NoSQL:
Graph
Stores

• An
entry
is
a
node

• Nodes
have
proper*es

• Edges
are
links
between
nodes

39

NoSQL:
Graph
Stores

• Beneﬁts:

– Faster
to
fetch
an
entry
and
its
related
entries

(links
are
already
resolved,
no
need
to
join)

– Flexible
data
schema

• Drawbacks:

– Complex
APIs

– Slow
for
batch
opera*ons

– Open
source
implems
are
not
that
good…

40

The
real
issues…

SCALABILITY
IN
PRACTICE

41

CAP
Theorem

• CAP:

– Consistency:
Opera*ng
fully
or
not
at
all.

– Availability:
The
service
must
be
reachable
at

any
*me.

– Par,,on
Tolerance:
No
set
of
failures
less
than

total
network
failure
is
allowed
to
cause
the

system
to
respond
incorrectly.

Any
shared-‐data
system
can
only
achieve
two
of

these
three.
CAP Theorem, Dr. Eric Brewer, Berkeley (2000)
42

Consistent
Hashing

• Ensuring
data
availability:
replica*on
!

• Reaching
the
right
nodes
?
Hashing

• Consistent
hashing:
Hash
ring

– Objects
are
mapped
into
a
range

– Nodes
are
mapped
into
that

range

– We
write
the
object
into
the

nearest
node,
clockwise

43

Data
consistency

• Ensuring
data
eventual
consistency:
Quorum
writes

– W
=
number
of
writes
to
ensure
before
returning
OK

– R
=
number
of
reads
to
ensure

– N
=
replica*on
factor

• W
<
N
==
High
write
availability

– Data
may
be
lost
or
outdated
if
read
from
another
node

• R
<
N
==
High
read
availability

– Data
may
be
outdated

• W
+
R
>
N
==
Full
consistency
!

– But
slower
writes
/
reads

44

Conflicts
resolu,on

• What
happens
when
R
>
1
and
two
different
versions

are
found
?

• Conflict
resolu*on
!

• Common
algorithm:

Vector
clocks

45

Vector
clocks

• Assign
to
each
node
a
unique
ID

• A
node
increments
its
own
vector
and
keep

track
of
the
old
entries

46

Elas,city:
Gossip
Membership

• When
a
node
joins…

47

Elas,city:
Gossip
Membership

• When
a
node
crashes
!

48

I’m
star*ng
the
next
big
startup…

WHAT’S
THE
BEST
SYSTEM
?

Choosing
your
storage
system

• “Don’t
op,mize
too
early”

• MySQL
is
robust
and
works
VERY
well

– You’ll
know
where
bugs
come
from
(you)

• Key-‐Value
stores
are
hype,
and
o`en
badly

implemented

• Anyway,
most
mature
“NoSQL”
systems:

– MongoDB

– Cassandra

50

Exalead managing terrabytes

More Related Content

Similar to Exalead managing terrabytes

Recently uploaded

Exalead managing terrabytes