SlideShare a Scribd company logo
Why databases cry at night
Michael Yarichuk
Hibernating Rhinos
Why databases cry at night
Why databases cry at night
Why databases cry at night
Why databases cry at night
Magic?
This Photo by Unknown Author is licensed under CC BY-NC-ND
This Photo by Unknown Author
is licensed under CC BY-SA
Query
Nope. Not magic!
Databases are abstractions
The photo is licensed under CC BY-SA.
The Law of Leaky Abstractions
"All non-trivial abstractions, to some degree, are leaky."
- Joel Spolsky
Many shades of grey databases
Many shades of grey databases
RDBMS Key/Value Store Document Database Graph Database
MS-SQL LMDB RavenDB Neo4j
MySQL BerkeleyDB MongoDB OrientDB*
Oracle Cassandra* CouchDB ArangoDB
PostgreSQL Dynamo CosmosDB JanusGraph
Very different databases have
the same reasons for crying.
Things we will take a look at
Storage algorithms & Storage
Indexing & Queries
Network
Let’s start with something simple:
Storage.
It just works, no?
Famous last words!
Before we discuss storage, here is a riddle…
RavenDB server-wide backup failed
• The instance had multiple databases in single instance
• Plenty of memory and cores, resource usage is small
• Nothing else was running on the machine EXCEPT RavenDB
• Scheduled backup tasks fail soon after they started
The backup tasks started at the
same time!
Also, there were gazillion of databases
SAN
Storage
Database
A
Database
B
Database
C
Database
D
Database
E Database
F
Database
G
...
Database
FF
Why databases cry at night
RavenDB’s failing backups
Approx. 200 databases doing backups at the same time WILL
cause storage saturation!
The solution was rather simple
Disk queue length can be an… issue
16
KB
8
KB
10
KB
16
KB
12
KB
12
KB
Disk Queue Length
Disk Write
Disk queue depth
16
KB
8
KB
10
KB
16
KB
12
KB
12
KB
Disk Queue Depth = 2
Disk Write
16
KB
8
KB
10
KB
16
KB
12
KB
8
KBThread 1
Thread 2
What can we do about storge issues?
• Load test database code to ensure
• Write-through throughput
• Enough IOPS for expected production load (disk queue length is <= 2)
• Cloud  provision IOPS
• Load-test application to find limits of the system
• Monitoring! (too long queues = storage bottleneck)
Storage performance benchmarks
• Sysinternals Process Monitor
• CrystalDiskMark
• ATTO Disk Benchmark
• (Many) other tools
CrystalDiskMark
• Random/sequential I/O?
• Queues/Threads (queue depth/length)
• Size of each read/write
But wait, there is more!
(about storage)
A tale of two primary keys
• One embedded transactional database engine (LMDB)
• 100 transactions, 100 key/value writes per transaction
• Two databases, keys and values have the same size
• One uses sequential keys (UuidCreateSequential)
• One uses random keys (UuidCreate)
A tale of two primary keys
0
20
40
60
80
100
120
140
160
180
0 2 4 6 8 10 12 14 16
#ofseeksperTX
Transaction #
B-Tree seeks per write TX
Random
Sequential
A tale of two primary keys
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0 2 4 6 8 10 12 14 16
TotalseeklatencyperTX
Transaction #
Total seek latency per TX
Random
Sequential
Process Monitor
Types of OS operations to listen (file activity, network activity, etc)
Types of operations
Why?
Storage algorithms (page-oriented storage engines)
• B-tree, B+ Tree
• Optimized for reads
• Optimized for sequential data
B+ tree keys
Sequential keys Non-sequential keys
The cost of hops in the tree
Disk seek
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Sequential reads
Minimize performance impact of keys
• Sequential keys allow better performance
• 1,2,3,4,5
• Users/1, Users/2, Users/3
• Notice: B-Trees are used to store data AND indexes
• Query performance!
There is another kind of storage
implementation…
The tale of an occasionally slow database
• Sometimes, Cassandra database was fast and sometimes not
• This happened non-deterministically
A page in
Cassandra
documentation!
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/3
Users/5
Users/22
Users/3
SSTable SSTable SSTable
Usually a B-Tree or Skip List
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/3
Users/5
Users/22
Users/3
Insert users/44
Users/44
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Update users/3
Users/44
Users/3
(update)
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Delete users/5
Users/44
Users/3 (update)
Users/5 (delete)
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Flush!
Users/44
Users/3 (update)
Users/5 (delete)
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
(appended at the end)
Users/44
Users/3 (update)
Users/5 (delete)
Storage algorithms (log-structured storage engines)
Users/1
Users/3
In-memory Storage
Users/5
Users/3
Users/22
Users/3
Reading requires searching ALL SSTables!
Users/44
Users/3 (update)
Users/5 (delete)
Storage algorithms – LSM Tree compaction
This is roughly O(n*log(n)) operation!
Key/Value 1
Key/Value 3
Key/Value 5
Key/Value 3
Key/Value 9
Key/Value 12
Key/Value 5
Key/Value 12
Key/Value 18
Key/Value 1
Key/Value 3
Key/Value 5
Key/Value 9
Key/Value 12
Key/Value 18
TX1
TX2
TX3
Storage algorithms – LSM Tree compaction
In-memory Storage
Users/1
Users/3
Users/22
Users/44
After compaction
Compaction strategies
• Compaction strategies  WHEN compaction is triggered?
• Leveled  optimization for inserts
• SizeTiered  optimization for reads
• Time-window  optimization for TimeSeries/immutable data
CQL
Now, let’s talk about another
important feature…
ACID guarantees!
• Atomicity, Consistency, Isolation, Durability
• Note: not all DBs support it
• All of RDBMS
• Some NoSQLs – RavenDB, LevelDB, LMDB
Write-ahead log (WAL) - Atomicity, Durability
Put "A" Put "B" Commit Put "X" CommitWAL
Data Storage
Flush
Writing
Write-ahead log (WAL)
• "Write-Through" writes – no caching (otherwise no durability!)
• Lots of small writes (overhead of each write)
Write-Through
Volatile
Buffer
Regular Write Periodic Flush
ATTO Disk Benchmark
Benchmarking modes
Buffered vs. Write-through
With buffers and caching (GBs/sec) No caching, Write-Through (MBs/sec)
NVMe SSD (Samsung 860 EVO m.2)
Let's talk a bit about indexing and querying
Storage algorithms & Storage
Indexing & Queries
Network
Are those queries different?
Let’s start from something… simple.
First, we define an index
We create an index that covers city and country fields of ShipTo
Then we do some queries
Fetching orders that were shipped to Paris or Lyon
So far so good…
And another query
Fetching orders that were shipped to all of France
It’s a gotcha!
Why?
Field 1 Field 2
Lyon France
Paris France
Oslo Norway
In some databases (like MongoDB)
• Indexed fields are concatenated into single index key
• Filtering only by prefix
Index Key Record IDs
LyonFrance [7],[1],[4]
ParisFrance [5],[6]
OsloNorway [12],[2],[9],[34]
The values are concatenated!
Why?
Field 1 Field 2
Lyon France
Paris France
Oslo Norway
In some databases (RavenDB, any Lucene-based index)
• Indexed terms stored separate
• Filtering by one or both fields in any order (union/intersect as needed)
Index Key Record IDs
Lyon [7],[1],[4]
Paris [5],[6]
Oslo [12],[2],[9],[34]
Index Key Record IDs
France [7],[1],[4],[5],[6]
Norway [12],[2],[9],[34]
ShipTo.City Index
ShipTo.Country
Index
Collection/table scans are easily overlooked
Collection scan - development
• Small amount of data
• Extremely small query latency
Collection scan - production
• Large amount of data
• HUGE latency (quite often!)
Latency: 50ms vs 50 hours
Indexing
• Indexes are stored as trees (usually B-trees)
• Updates have non-trivial complexity!
https://commons.wikimedia.org/wiki/File:Trie_example.svg
Indexing
Search time complexity (WHERE clause):
O(log(N)) + O(log(M)) + O(Max(K,P))
Where:
• N and M are amount of rows in indexes
• K and P are result sets of index searches
And if we use RDBMS, things become even
more interesting...
Join Algorithm Complexity
Merge Join O(n*log(n) + m*log(m))
Hash Join O(n + m)
Index Join O(m*log(n))
And if we have a non-trivial query...
https://dev.to/tyzia/example-of-complex-sql-query-to-get-as-much-data-as-possible-from-database-9he
Those are 10 JOIN statements!
…we have complexity between
O(log(n)) and O(too much)!
More often than not it is O(too much)…
What can (should!) we do?
• RDBMS
• Proper indexing (kinda obvious, but still )
• Optimize (remove unnecessary JOINs – depends on business logic)
• Reduce query complexity
• Replace ‘row by row’ cursors with set based queries
• Reduce the amount of work queries do (for example, unnecessary sub-queries)
• Remove ORDER BY where it makes sense (huge overhead)
• Other optimizations are possible
• NoSQL
• Proper modeling
• Well planned indexing
Let’s talk a bit about NoSQL Data Modeling
• Documents are independent
• Transaction borders
• Depend on context!
• Depend on query pattern
• Depend on data growth
Well planned indexing?
Well planned indexing?
Networking can be a bottleneck too!
Storage algorithms & Storage
Indexing & Queries
Network
Here is a riddle: why a query with 100 results
consistently took several seconds to complete?
Hint: the request spends < 10ms on the server
The investigation
1. Look at query latency on the server
The investigation
2. Look at Fiddler timings Response Size
Latency
Network bandwith is not infinite!
3mb
document
2mb
document
5mb
document
Query Results
Database Client API
Solution: server-side projections (NoSQL)
Server-side Projection
RQL
Solution: server-side projections (NoSQL)
Server-side Projection
MongoDB API
Also… database requests can be an
interesting issue…
Network overhead
https://github.com/dajuric/simple-http
TCP handshake
Network overhead
Round Trip Time (RTT)
• Physical distance (insignificant for LANs)
• Bandwidth
• Network hops
Round-trip Time
So, what can we do?
What can we do?
• Refactor to reduce number of requests (kinda obvious, but still…)
• NHibernate – Future Queries
• Entity Framework - QueryFuture
• RavenDB – Lazy Queries
May sound trivial, but…
Do take a look at database traffic while stress testing and if possible in
production too.
• Fiddler
• Wireshark
• Profilers
• Any other tool to inspect traffic
To sum it up
• Databases are abstractions
• Abstractions are leaky and might be the cause of perf issues
• Such perf issues can be dealt with (if we know about the "leak"!)
Questions?
michael.yarichuk@hibernatingrhinos.com
@myarichuk
This Photo by Unknown author is licensed under CC BY-SA.
https://github.com/ravendb/ravendb
https://github.com/myarichuk/PerfDemo-Sequential-vs-Random-Key

More Related Content

PPTX
Elasticsearch - under the hood
SmartCat
 
PDF
Elasticsearch 101 - Cluster setup and tuning
Petar Djekic
 
PPTX
ElasticSearch AJUG 2013
Roy Russo
 
PDF
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Karel Minarik
 
PPT
NYJavaSIG - Big Data Microservices w/ Speedment
Speedment, Inc.
 
PDF
SQL for Elasticsearch
Jodok Batlogg
 
PDF
Sasi, cassandra on full text search ride
Duyhai Doan
 
PPTX
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 
Elasticsearch - under the hood
SmartCat
 
Elasticsearch 101 - Cluster setup and tuning
Petar Djekic
 
ElasticSearch AJUG 2013
Roy Russo
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Karel Minarik
 
NYJavaSIG - Big Data Microservices w/ Speedment
Speedment, Inc.
 
SQL for Elasticsearch
Jodok Batlogg
 
Sasi, cassandra on full text search ride
Duyhai Doan
 
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 

What's hot (20)

PDF
Big data 101 for beginners riga dev days
Duyhai Doan
 
PPTX
Elasticsearch - DevNexus 2015
Roy Russo
 
PDF
SDEC2011 NoSQL concepts and models
Korea Sdec
 
PDF
Spark Cassandra 2016
Duyhai Doan
 
PDF
Roaring with elastic search sangam2018
Vinay Kumar
 
PPTX
Elastic Search
Navule Rao
 
KEY
Elasticsearch - Devoxx France 2012 - English version
David Pilato
 
PDF
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Rafał Kuć
 
PPTX
memcached Distributed Cache
Aniruddha Chakrabarti
 
PDF
Introduction to Elasticsearch
Sperasoft
 
ODP
Elasticsearch presentation 1
Maruf Hassan
 
PDF
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
PDF
Datastax enterprise presentation
Duyhai Doan
 
PDF
Managing Your Content with Elasticsearch
Samantha Quiñones
 
PPT
Building a CRM on top of ElasticSearch
Mark Greene
 
PDF
Elasticsearch for Data Analytics
Felipe
 
PDF
Cassandra introduction 2016
Duyhai Doan
 
PDF
Cassandra 3 new features 2016
Duyhai Doan
 
PDF
Big data 101 for beginners devoxxpl
Duyhai Doan
 
PDF
NOSQL Overview
Tobias Lindaaker
 
Big data 101 for beginners riga dev days
Duyhai Doan
 
Elasticsearch - DevNexus 2015
Roy Russo
 
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Spark Cassandra 2016
Duyhai Doan
 
Roaring with elastic search sangam2018
Vinay Kumar
 
Elastic Search
Navule Rao
 
Elasticsearch - Devoxx France 2012 - English version
David Pilato
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Rafał Kuć
 
memcached Distributed Cache
Aniruddha Chakrabarti
 
Introduction to Elasticsearch
Sperasoft
 
Elasticsearch presentation 1
Maruf Hassan
 
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Datastax enterprise presentation
Duyhai Doan
 
Managing Your Content with Elasticsearch
Samantha Quiñones
 
Building a CRM on top of ElasticSearch
Mark Greene
 
Elasticsearch for Data Analytics
Felipe
 
Cassandra introduction 2016
Duyhai Doan
 
Cassandra 3 new features 2016
Duyhai Doan
 
Big data 101 for beginners devoxxpl
Duyhai Doan
 
NOSQL Overview
Tobias Lindaaker
 
Ad

Similar to Why databases cry at night (20)

PDF
NoSQL for great good [hanoi.rb talk]
Huy Do
 
PPTX
Hardware Provisioning
MongoDB
 
PDF
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
PPTX
Introduction to Data Science NoSQL.pptx
tarakesh7199
 
PDF
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
PDF
NOsql Presentation.pdf
AkshayDwivedi31
 
PPT
Implementing the Databese Server session 02
Guillermo Julca
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PDF
No sq lv1_0
Tuan Luong
 
PPTX
cours database pour etudiant NoSQL (1).pptx
ssuser1fde9c
 
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
PDF
Know thy cost (or where performance problems lurk)
Oren Eini
 
PDF
Database Technologies
Michel de Goede
 
ODP
MySQL And Search At Craigslist
Jeremy Zawodny
 
ODP
Vote NO for MySQL
Ulf Wendel
 
PPTX
NoSQL.pptx
RithikRaj25
 
PPT
NoSql Databases
Nimat Khattak
 
PDF
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
PDF
Scaling with mongo db (with notes)
emiltamas
 
NoSQL for great good [hanoi.rb talk]
Huy Do
 
Hardware Provisioning
MongoDB
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
Introduction to Data Science NoSQL.pptx
tarakesh7199
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
NOsql Presentation.pdf
AkshayDwivedi31
 
Implementing the Databese Server session 02
Guillermo Julca
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
No sq lv1_0
Tuan Luong
 
cours database pour etudiant NoSQL (1).pptx
ssuser1fde9c
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
Know thy cost (or where performance problems lurk)
Oren Eini
 
Database Technologies
Michel de Goede
 
MySQL And Search At Craigslist
Jeremy Zawodny
 
Vote NO for MySQL
Ulf Wendel
 
NoSQL.pptx
RithikRaj25
 
NoSql Databases
Nimat Khattak
 
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Scaling with mongo db (with notes)
emiltamas
 
Ad

Recently uploaded (20)

PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Presentation about variables and constant.pptx
safalsingh810
 
Role Of Python In Programing Language.pptx
jaykoshti048
 

Why databases cry at night

  • 1. Why databases cry at night Michael Yarichuk Hibernating Rhinos
  • 6. Magic? This Photo by Unknown Author is licensed under CC BY-NC-ND This Photo by Unknown Author is licensed under CC BY-SA Query
  • 8. Databases are abstractions The photo is licensed under CC BY-SA.
  • 9. The Law of Leaky Abstractions "All non-trivial abstractions, to some degree, are leaky." - Joel Spolsky
  • 10. Many shades of grey databases
  • 11. Many shades of grey databases RDBMS Key/Value Store Document Database Graph Database MS-SQL LMDB RavenDB Neo4j MySQL BerkeleyDB MongoDB OrientDB* Oracle Cassandra* CouchDB ArangoDB PostgreSQL Dynamo CosmosDB JanusGraph
  • 12. Very different databases have the same reasons for crying.
  • 13. Things we will take a look at Storage algorithms & Storage Indexing & Queries Network
  • 14. Let’s start with something simple: Storage. It just works, no? Famous last words!
  • 15. Before we discuss storage, here is a riddle… RavenDB server-wide backup failed • The instance had multiple databases in single instance • Plenty of memory and cores, resource usage is small • Nothing else was running on the machine EXCEPT RavenDB • Scheduled backup tasks fail soon after they started
  • 16. The backup tasks started at the same time!
  • 17. Also, there were gazillion of databases SAN Storage Database A Database B Database C Database D Database E Database F Database G ... Database FF
  • 19. RavenDB’s failing backups Approx. 200 databases doing backups at the same time WILL cause storage saturation!
  • 20. The solution was rather simple
  • 21. Disk queue length can be an… issue 16 KB 8 KB 10 KB 16 KB 12 KB 12 KB Disk Queue Length Disk Write
  • 22. Disk queue depth 16 KB 8 KB 10 KB 16 KB 12 KB 12 KB Disk Queue Depth = 2 Disk Write 16 KB 8 KB 10 KB 16 KB 12 KB 8 KBThread 1 Thread 2
  • 23. What can we do about storge issues? • Load test database code to ensure • Write-through throughput • Enough IOPS for expected production load (disk queue length is <= 2) • Cloud  provision IOPS • Load-test application to find limits of the system • Monitoring! (too long queues = storage bottleneck)
  • 24. Storage performance benchmarks • Sysinternals Process Monitor • CrystalDiskMark • ATTO Disk Benchmark • (Many) other tools
  • 25. CrystalDiskMark • Random/sequential I/O? • Queues/Threads (queue depth/length) • Size of each read/write
  • 26. But wait, there is more! (about storage)
  • 27. A tale of two primary keys • One embedded transactional database engine (LMDB) • 100 transactions, 100 key/value writes per transaction • Two databases, keys and values have the same size • One uses sequential keys (UuidCreateSequential) • One uses random keys (UuidCreate)
  • 28. A tale of two primary keys 0 20 40 60 80 100 120 140 160 180 0 2 4 6 8 10 12 14 16 #ofseeksperTX Transaction # B-Tree seeks per write TX Random Sequential
  • 29. A tale of two primary keys 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0 2 4 6 8 10 12 14 16 TotalseeklatencyperTX Transaction # Total seek latency per TX Random Sequential
  • 30. Process Monitor Types of OS operations to listen (file activity, network activity, etc) Types of operations
  • 31. Why?
  • 32. Storage algorithms (page-oriented storage engines) • B-tree, B+ Tree • Optimized for reads • Optimized for sequential data
  • 33. B+ tree keys Sequential keys Non-sequential keys
  • 34. The cost of hops in the tree Disk seek https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html Sequential reads
  • 35. Minimize performance impact of keys • Sequential keys allow better performance • 1,2,3,4,5 • Users/1, Users/2, Users/3 • Notice: B-Trees are used to store data AND indexes • Query performance!
  • 36. There is another kind of storage implementation…
  • 37. The tale of an occasionally slow database • Sometimes, Cassandra database was fast and sometimes not • This happened non-deterministically A page in Cassandra documentation! https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html
  • 38. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/3 Users/5 Users/22 Users/3 SSTable SSTable SSTable Usually a B-Tree or Skip List
  • 39. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/3 Users/5 Users/22 Users/3 Insert users/44 Users/44
  • 40. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/5 Users/3 Users/22 Users/3 Update users/3 Users/44 Users/3 (update)
  • 41. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/5 Users/3 Users/22 Users/3 Delete users/5 Users/44 Users/3 (update) Users/5 (delete)
  • 42. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/5 Users/3 Users/22 Users/3 Flush! Users/44 Users/3 (update) Users/5 (delete)
  • 43. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/5 Users/3 Users/22 Users/3 (appended at the end) Users/44 Users/3 (update) Users/5 (delete)
  • 44. Storage algorithms (log-structured storage engines) Users/1 Users/3 In-memory Storage Users/5 Users/3 Users/22 Users/3 Reading requires searching ALL SSTables! Users/44 Users/3 (update) Users/5 (delete)
  • 45. Storage algorithms – LSM Tree compaction This is roughly O(n*log(n)) operation! Key/Value 1 Key/Value 3 Key/Value 5 Key/Value 3 Key/Value 9 Key/Value 12 Key/Value 5 Key/Value 12 Key/Value 18 Key/Value 1 Key/Value 3 Key/Value 5 Key/Value 9 Key/Value 12 Key/Value 18 TX1 TX2 TX3
  • 46. Storage algorithms – LSM Tree compaction In-memory Storage Users/1 Users/3 Users/22 Users/44 After compaction
  • 47. Compaction strategies • Compaction strategies  WHEN compaction is triggered? • Leveled  optimization for inserts • SizeTiered  optimization for reads • Time-window  optimization for TimeSeries/immutable data CQL
  • 48. Now, let’s talk about another important feature…
  • 49. ACID guarantees! • Atomicity, Consistency, Isolation, Durability • Note: not all DBs support it • All of RDBMS • Some NoSQLs – RavenDB, LevelDB, LMDB
  • 50. Write-ahead log (WAL) - Atomicity, Durability Put "A" Put "B" Commit Put "X" CommitWAL Data Storage Flush Writing
  • 51. Write-ahead log (WAL) • "Write-Through" writes – no caching (otherwise no durability!) • Lots of small writes (overhead of each write) Write-Through Volatile Buffer Regular Write Periodic Flush
  • 53. Buffered vs. Write-through With buffers and caching (GBs/sec) No caching, Write-Through (MBs/sec) NVMe SSD (Samsung 860 EVO m.2)
  • 54. Let's talk a bit about indexing and querying Storage algorithms & Storage Indexing & Queries Network
  • 55. Are those queries different?
  • 56. Let’s start from something… simple.
  • 57. First, we define an index We create an index that covers city and country fields of ShipTo
  • 58. Then we do some queries Fetching orders that were shipped to Paris or Lyon
  • 59. So far so good…
  • 60. And another query Fetching orders that were shipped to all of France
  • 62. Why? Field 1 Field 2 Lyon France Paris France Oslo Norway In some databases (like MongoDB) • Indexed fields are concatenated into single index key • Filtering only by prefix Index Key Record IDs LyonFrance [7],[1],[4] ParisFrance [5],[6] OsloNorway [12],[2],[9],[34] The values are concatenated!
  • 63. Why? Field 1 Field 2 Lyon France Paris France Oslo Norway In some databases (RavenDB, any Lucene-based index) • Indexed terms stored separate • Filtering by one or both fields in any order (union/intersect as needed) Index Key Record IDs Lyon [7],[1],[4] Paris [5],[6] Oslo [12],[2],[9],[34] Index Key Record IDs France [7],[1],[4],[5],[6] Norway [12],[2],[9],[34] ShipTo.City Index ShipTo.Country Index
  • 64. Collection/table scans are easily overlooked Collection scan - development • Small amount of data • Extremely small query latency Collection scan - production • Large amount of data • HUGE latency (quite often!) Latency: 50ms vs 50 hours
  • 65. Indexing • Indexes are stored as trees (usually B-trees) • Updates have non-trivial complexity! https://commons.wikimedia.org/wiki/File:Trie_example.svg
  • 66. Indexing Search time complexity (WHERE clause): O(log(N)) + O(log(M)) + O(Max(K,P)) Where: • N and M are amount of rows in indexes • K and P are result sets of index searches
  • 67. And if we use RDBMS, things become even more interesting... Join Algorithm Complexity Merge Join O(n*log(n) + m*log(m)) Hash Join O(n + m) Index Join O(m*log(n))
  • 68. And if we have a non-trivial query... https://dev.to/tyzia/example-of-complex-sql-query-to-get-as-much-data-as-possible-from-database-9he Those are 10 JOIN statements!
  • 69. …we have complexity between O(log(n)) and O(too much)! More often than not it is O(too much)…
  • 70. What can (should!) we do? • RDBMS • Proper indexing (kinda obvious, but still ) • Optimize (remove unnecessary JOINs – depends on business logic) • Reduce query complexity • Replace ‘row by row’ cursors with set based queries • Reduce the amount of work queries do (for example, unnecessary sub-queries) • Remove ORDER BY where it makes sense (huge overhead) • Other optimizations are possible • NoSQL • Proper modeling • Well planned indexing
  • 71. Let’s talk a bit about NoSQL Data Modeling • Documents are independent • Transaction borders • Depend on context! • Depend on query pattern • Depend on data growth
  • 74. Networking can be a bottleneck too! Storage algorithms & Storage Indexing & Queries Network
  • 75. Here is a riddle: why a query with 100 results consistently took several seconds to complete? Hint: the request spends < 10ms on the server
  • 76. The investigation 1. Look at query latency on the server
  • 77. The investigation 2. Look at Fiddler timings Response Size Latency
  • 78. Network bandwith is not infinite! 3mb document 2mb document 5mb document Query Results Database Client API
  • 79. Solution: server-side projections (NoSQL) Server-side Projection RQL
  • 80. Solution: server-side projections (NoSQL) Server-side Projection MongoDB API
  • 81. Also… database requests can be an interesting issue…
  • 83. Network overhead Round Trip Time (RTT) • Physical distance (insignificant for LANs) • Bandwidth • Network hops Round-trip Time
  • 84. So, what can we do?
  • 85. What can we do? • Refactor to reduce number of requests (kinda obvious, but still…) • NHibernate – Future Queries • Entity Framework - QueryFuture • RavenDB – Lazy Queries
  • 86. May sound trivial, but… Do take a look at database traffic while stress testing and if possible in production too. • Fiddler • Wireshark • Profilers • Any other tool to inspect traffic
  • 87. To sum it up • Databases are abstractions • Abstractions are leaky and might be the cause of perf issues • Such perf issues can be dealt with (if we know about the "leak"!)
  • 88. Questions? [email protected] @myarichuk This Photo by Unknown author is licensed under CC BY-SA. https://github.com/ravendb/ravendb https://github.com/myarichuk/PerfDemo-Sequential-vs-Random-Key

Editor's Notes

  • #2: Do not forget: say thanks for coming to my talk!
  • #3: I have been working on internals of a NoSQL database for some time now. Mostly I am working on clusters and distributed stuff like replication. Also, I do support calls and in this talk I will talk about various performance issues I have seen
  • #4: … a company called Hibernating Rhinos. We have created profilers for ORMs, but our main business is (click to next slide)
  • #5: …our main business is RavenDB – NoSQL document database .Net Core  MENTION that RavenDB is built on .Net Core
  • #6: Many times developers deal with databases by just tweaking settings and blindly playing around.
  • #7: It is very tempting to think of databases as magic boxes You query a database and get answers to complex queries… Things simply… work.
  • #8: …but databases are not magic!
  • #9: Databases are abstractions to complex internals  Many things can go wrong under the outermost layer  the API
  • #10: Why things can go wrong? Because ALL abstractions are leaky
  • #11: There are many types of databases Give concrete examples of each type of database RDBMS: MySQL, PostgreSQL, MS-SQL, Oracle Document DBs: CouchDB, MongoDB, RavenDB GraphDB: Neo4j, OrientDB Key/value store: LMDB, BerkeleyDB, Cassandra (hybrid between table and key/value)
  • #12: Note that there are hybrid databases such as Cassandra (hybrid column-family store) and OrientDB (hybrid document & graph) (more than 400 database companies)
  • #13: Databases are different, but deep under the hood they use the same algorithms
  • #14: We will take a look at multiple reasons why databases will cry. Note: mention that things we take a look at only scratch the surface of potential issues
  • #15: We will start from storage..
  • #16: Before we jump into details, here is a riddle… Mention this is a real support case involve audience here  ask them to theorize
  • #18: Mention that looking at monitoring numbers (more specifically info about disk perf)
  • #19: * Mention that this is relevant to many types of DBs Mention it was SAN timeout  Those database were stored at SAN disk SHARED with other systems
  • #21: The solution? Make the backups happen one after another (so the storage won’t be overloaded)
  • #23: How many concurrent IO the storage supports?
  • #24: Mention that IOPS is input/output operations per second
  • #25: Load testing of software is understandable, but what about load testing of hardware (storage)? Tools are needed for it…
  • #26: Sequential/Random Sequential – accessing files sequentially (for things like video streaming) Random – accessing files randomly  more seeks Queues – how much writes each process does concurrently Threads – how much processes access the disk concurrently
  • #27: And now, let’s take a look at yet another reason why databases can cry at night
  • #28: Tell a bit about LMDB  what is it? (OpenLDAP, some Linux distros such as Ubuntu, Debian and Fedora) Tell about WHAT kind of thing I checked,
  • #31: Just a note – this information is gathered using Sysinternals Process Monitor In Linux there is an equivalent - a strace (it is a tool) Process Monitor is useful to trace the ways any application accesses storage, including OS calls and latencies (in this context)
  • #33: Not the only kind of trees that are used, but B-Trees are the most prevalent * B-Trees are self-balancing! Used in: CouchDB, RavenDB, MS-SQL, Oracle
  • #34: Non-sequential keys in B trees can cause it being much deeper. In here – the same data but with different keys Conclusion: different data patterns can give DIFFERENT performance on the same engine
  • #36: How many is many?  I am talking about millions of data records here
  • #38: Mention that this is a true story: Tell about a user that told such story at a workshop (London)
  • #39: Log Structured Merge Trees are used in database storage engines Mention that it is optimized for writes * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * SSTable (each column)  ordered immutable map from keys to values * data is synchronized between the two structures efficiently, in batches
  • #40: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #41: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #42: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #43: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #44: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #45: Reading requires searching ALL SSTables because  Since we store data mutation operations, we may have an insert followed by delete operation, where delete would cancel the originally inserted record
  • #46: Compaction of SSTables is O(n*S*log(S)) operation  (n -> amount of SSTables, S is count of key/values in each) Note:  * the compaction is usually asynchronous * multi-level compaction can be implemented * But it adds strain to system resources! (Memory, CPUs, Storage)
  • #47: Mention that it is optimized for writes Log Structured Merge Trees are used in database storage engines * Used in Lucene, LevelDB, HBase, SQLite4, Apache Cassandra, Tarantool (mail.ru) * the data is kept in multiple different structures, each is optimized for different storages  (memory/disk); * data is synchronized between the two structures efficiently, in batches
  • #48: Usually, there are different types of compaction strategies are possible For example - LSM compaction strategies in Cassandra Mention  this is a solution!
  • #49: Since we are talking about storage related algorithms, There is another thing to consider!
  • #50: ACID – if a database supports it, it’s basically – You will lose data only if your server burns down So, what’s to talk about them? Except that not all databases support them…
  • #51: Mention that WAL is used to implement Atomicity and Durability Explain how it works (in general terms)
  • #52: Any database that uses WAL benefits from good no-cache performance
  • #53: We can test for no-cache writes using tools like ATTO Disk Benchmark
  • #54: There is a big difference between regular write and a “no-cache” write! * Note that the larger the write batch, the more efficient it gets  - Mention the metaphor of moving people in cars vs moving in them buses * Note the discrepancy - for 1MB writes 7.95 GB/s vs. 313.22 MB/s
  • #56: Need to EXPLICITLY say what kind of query we are doing: Select ALL ORDERS that were sent to Paris or Lyon AND ordered after a *certain* day
  • #57: we want to be able to query Orders by Shipping city and country
  • #58: Let’s see an example WHY it is important…
  • #60: We do index scan, so everything is good…
  • #62: This is not so good! Because collection scans have linear complexity!
  • #63: Thus, we can only query by first field OR by both fields
  • #64: Thus, we can only query by first field OR by both fields
  • #65: Beware of collection scans!
  • #66: Since the index is TYPICALLY stored as a tree (can't think of a case where it isn't a tree!) - ROUGHLY we would have O(log(n)) complexity
  • #67: don't forget to mention: searching an index is O(log(n)) EXCEPT for special cases like hash indexes or TRIE 2) the mentioned complexity DEPENDS on a query plan of course 3) M + N is complexity of intersection between two query results
  • #71: RDBMS optimizations – partial denormalization? Merge tables?
  • #72: Independent: all data required to process a document is stored within the document itself
  • #73: Well, not exactly indexing, but still… MongoDB map/reduce - find how many blogs one author has with its first name and last name. Note that such operation will do allocations and will take CPU resources It needs to reflect business logic AND be as efficient as possible
  • #74: Explain about RavenDB indexes that are defined via LINQ The same thing with RavenDB indexes. Such index is flexible but it will do allocations and will take CPU resources
  • #78: Discrepancy over request/response latency and time spent on server + total response size (3MB response took 3 seconds to download)
  • #80: The solution was to specify server-side projection  So ONLY needed information is sent over the wire
  • #81: Return only name and cuisine fields
  • #86: Ask audience who knows about query futures