Why databases cry at night

Why databases cry at night
Michael Yarichuk
Hibernating Rhinos

Magic?
This Photo by Unknown Author is licensed under CC BY-NC-ND
This Photo by Unknown Author
is licensed under CC BY-SA
Query

Databases are abstractions
The photo is licensed under CC BY-SA.

The Law of Leaky Abstractions
"All non-trivial abstractions, to some degree, are leaky."
- Joel Spolsky

Many shades of grey databases
RDBMS Key/Value Store Document Database Graph Database
MS-SQL LMDB RavenDB Neo4j
MySQL BerkeleyDB MongoDB OrientDB*
Oracle Cassandra* CouchDB ArangoDB
PostgreSQL Dynamo CosmosDB JanusGraph

Very different databases have
the same reasons for crying.

Things we will take a look at
Storage algorithms & Storage
Indexing & Queries
Network

Let’s start with something simple:
Storage.
It just works, no?
Famous last words!

Before we discuss storage, here is a riddle…
RavenDB server-wide backup failed
• The instance had multiple databases in single instance
• Plenty of memory and cores, resource usage is small
• Nothing else was running on the machine EXCEPT RavenDB
• Scheduled backup tasks fail soon after they started

The backup tasks started at the
same time!

Also, there were gazillion of databases
SAN
Storage
Database
A
Database
B
Database
C
Database
D
Database
E Database
F
Database
G
...
Database
FF

RavenDB’s failing backups
Approx. 200 databases doing backups at the same time WILL
cause storage saturation!

The solution was rather simple

Disk queue length can be an… issue
16
KB
8
KB
10
KB
16
KB
12
KB
12
KB
Disk Queue Length
Disk Write

Disk queue depth
16
KB
8
KB
10
KB
16
KB
12
KB
12
KB
Disk Queue Depth = 2
Disk Write
16
KB
8
KB
10
KB
16
KB
12
KB
8
KBThread 1
Thread 2

What can we do about storge issues?
• Load test database code to ensure
• Write-through throughput
• Enough IOPS for expected production load (disk queue length is <= 2)
• Cloud  provision IOPS
• Load-test application to find limits of the system
• Monitoring! (too long queues = storage bottleneck)

Storage performance benchmarks
• Sysinternals Process Monitor
• CrystalDiskMark
• ATTO Disk Benchmark
• (Many) other tools

CrystalDiskMark
• Random/sequential I/O?
• Queues/Threads (queue depth/length)
• Size of each read/write

But wait, there is more!
(about storage)

A tale of two primary keys
• One embedded transactional database engine (LMDB)
• 100 transactions, 100 key/value writes per transaction
• Two databases, keys and values have the same size
• One uses sequential keys (UuidCreateSequential)
• One uses random keys (UuidCreate)

0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12 14 16
#ofseeksperTX
Transaction #
B-Tree seeks per write TX
Random
Sequential

0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0 2 4 6 8 10 12 14 16
TotalseeklatencyperTX
Transaction #
Total seek latency per TX
Random
Sequential

Process Monitor
Types of OS operations to listen (file activity, network activity, etc)
Types of operations

Storage algorithms (page-oriented storage engines)
• B-tree, B+ Tree
• Optimized for reads
• Optimized for sequential data

B+ tree keys
Sequential keys Non-sequential keys

The cost of hops in the tree
Disk seek
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Sequential reads

Minimize performance impact of keys
• Sequential keys allow better performance
• 1,2,3,4,5
• Users/1, Users/2, Users/3
• Notice: B-Trees are used to store data AND indexes
• Query performance!

There is another kind of storage
implementation…

The tale of an occasionally slow database
• Sometimes, Cassandra database was fast and sometimes not
• This happened non-deterministically
A page in
Cassandra
documentation!
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html

Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/3
Users/5
Users/22
Users/3
SSTable SSTable SSTable
Usually a B-Tree or Skip List

Users/1
Users/3
In-memory Storage
Users/3
Users/5
Users/22
Users/3
Insert users/44
Users/44

Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Update users/3
Users/44
Users/3
(update)

Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Delete users/5
Users/44
Users/3 (update)
Users/5 (delete)

Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Flush!
Users/44
Users/3 (update)
Users/5 (delete)

Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
(appended at the end)
Users/44
Users/3 (update)
Users/5 (delete)

Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Reading requires searching ALL SSTables!
Users/44
Users/3 (update)
Users/5 (delete)

Storage algorithms – LSM Tree compaction
This is roughly O(n*log(n)) operation!
Key/Value 1
Key/Value 3
Key/Value 5
Key/Value 3
Key/Value 9
Key/Value 12
Key/Value 5
Key/Value 12
Key/Value 18
Key/Value 1
Key/Value 3
Key/Value 5
Key/Value 9
Key/Value 12
Key/Value 18
TX1
TX2
TX3

Storage algorithms – LSM Tree compaction
In-memory Storage
Users/1
Users/3
Users/22
Users/44
After compaction

Compaction strategies
• Compaction strategies  WHEN compaction is triggered?
• Leveled  optimization for inserts
• SizeTiered  optimization for reads
• Time-window  optimization for TimeSeries/immutable data
CQL

Now, let’s talk about another
important feature…

ACID guarantees!
• Atomicity, Consistency, Isolation, Durability
• Note: not all DBs support it
• All of RDBMS
• Some NoSQLs – RavenDB, LevelDB, LMDB

Write-ahead log (WAL) - Atomicity, Durability
Put "A" Put "B" Commit Put "X" CommitWAL
Data Storage
Flush
Writing

Write-ahead log (WAL)
• "Write-Through" writes – no caching (otherwise no durability!)
• Lots of small writes (overhead of each write)
Write-Through
Volatile
Buffer
Regular Write Periodic Flush

ATTO Disk Benchmark
Benchmarking modes

Buffered vs. Write-through
With buffers and caching (GBs/sec) No caching, Write-Through (MBs/sec)
NVMe SSD (Samsung 860 EVO m.2)

Let's talk a bit about indexing and querying
Storage algorithms & Storage
Indexing & Queries
Network

Let’s start from something… simple.

First, we define an index
We create an index that covers city and country fields of ShipTo

Then we do some queries
Fetching orders that were shipped to Paris or Lyon

And another query
Fetching orders that were shipped to all of France

Why?
Field 1 Field 2
Lyon France
Paris France
Oslo Norway
In some databases (like MongoDB)
• Indexed fields are concatenated into single index key
• Filtering only by prefix
Index Key Record IDs
LyonFrance [7],[1],[4]
ParisFrance [5],[6]
OsloNorway [12],[2],[9],[34]
The values are concatenated!

Why?
Field 1 Field 2
Lyon France
Paris France
Oslo Norway
In some databases (RavenDB, any Lucene-based index)
• Indexed terms stored separate
• Filtering by one or both fields in any order (union/intersect as needed)
Lyon [7],[1],[4]
Paris [5],[6]
Oslo [12],[2],[9],[34]
France [7],[1],[4],[5],[6]
Norway [12],[2],[9],[34]
ShipTo.City Index
ShipTo.Country
Index

Collection/table scans are easily overlooked
Collection scan - development
• Small amount of data
• Extremely small query latency
Collection scan - production
• Large amount of data
• HUGE latency (quite often!)
Latency: 50ms vs 50 hours

Indexing
• Indexes are stored as trees (usually B-trees)
• Updates have non-trivial complexity!
https://commons.wikimedia.org/wiki/File:Trie_example.svg

Indexing
Search time complexity (WHERE clause):
O(log(N)) + O(log(M)) + O(Max(K,P))
Where:
• N and M are amount of rows in indexes
• K and P are result sets of index searches

And if we use RDBMS, things become even
more interesting...
Join Algorithm Complexity
Merge Join O(n*log(n) + m*log(m))
Hash Join O(n + m)
Index Join O(m*log(n))

And if we have a non-trivial query...
https://dev.to/tyzia/example-of-complex-sql-query-to-get-as-much-data-as-possible-from-database-9he
Those are 10 JOIN statements!

…we have complexity between
O(log(n)) and O(too much)!
More often than not it is O(too much)…

What can (should!) we do?
• RDBMS
• Proper indexing (kinda obvious, but still )
• Optimize (remove unnecessary JOINs – depends on business logic)
• Reduce query complexity
• Replace ‘row by row’ cursors with set based queries
• Reduce the amount of work queries do (for example, unnecessary sub-queries)
• Remove ORDER BY where it makes sense (huge overhead)
• Other optimizations are possible
• NoSQL
• Proper modeling
• Well planned indexing

Let’s talk a bit about NoSQL Data Modeling
• Documents are independent
• Transaction borders
• Depend on context!
• Depend on query pattern
• Depend on data growth

Networking can be a bottleneck too!
Storage algorithms & Storage
Indexing & Queries
Network

Here is a riddle: why a query with 100 results
consistently took several seconds to complete?
Hint: the request spends < 10ms on the server

The investigation
1. Look at query latency on the server

The investigation
2. Look at Fiddler timings Response Size
Latency

Network bandwith is not infinite!
3mb
document
2mb
document
5mb
document
Query Results
Database Client API

Solution: server-side projections (NoSQL)
Server-side Projection
RQL

Solution: server-side projections (NoSQL)
Server-side Projection
MongoDB API

Also… database requests can be an
interesting issue…

Network overhead
https://github.com/dajuric/simple-http
TCP handshake

Network overhead
Round Trip Time (RTT)
• Physical distance (insignificant for LANs)
• Bandwidth
• Network hops
Round-trip Time

What can we do?
• Refactor to reduce number of requests (kinda obvious, but still…)
• NHibernate – Future Queries
• Entity Framework - QueryFuture
• RavenDB – Lazy Queries

May sound trivial, but…
Do take a look at database traffic while stress testing and if possible in
production too.
• Fiddler
• Wireshark
• Profilers
• Any other tool to inspect traffic

To sum it up
• Databases are abstractions
• Abstractions are leaky and might be the cause of perf issues
• Such perf issues can be dealt with (if we know about the "leak"!)

Questions?
michael.yarichuk@hibernatingrhinos.com
@myarichuk
This Photo by Unknown author is licensed under CC BY-SA.
https://github.com/ravendb/ravendb
https://github.com/myarichuk/PerfDemo-Sequential-vs-Random-Key

Why databases cry at night

More Related Content

What's hot (20)

Similar to Why databases cry at night (20)

Recently uploaded (20)

Why databases cry at night

Editor's Notes