SlideShare a Scribd company logo
Fast data in times of crisis with GPU accelerated database QikkDB | Business Breakfast | 23.4.2020
Adastra Group
Our Solution Portfolio
Webinar: Fast data in times of crisis with the help of GPU2
One Focus: Data & Digitalization
Advanced Analytics
(Big) Data
Engineering
Data Governance Cloud
Services
Machine Learning
& AI
Digital
Transformation
ADASTRA Group
Adastra introduction
3 Adastra Group
International consulting company
that creates functional solutions
in various sectors, facilitating
the transition to the digital era.
Cutting-edge software for data
quality management, Master Data
Management, and data governance.
Solutions to complex business
problems in risk management, sales,
and process optimization.
Specialist in mobile app
development.
Full-service creative agency based
on a strong technological
background.
Recruitment for banks, financial institutions,
telecoms and insurance companies, and many
others, including Adastra.
Artificial intelligence, machine
learning and optimization services.
Big data monetization solutions.
Webinar: Fast data in times of crisis with the help of GPU
Adastra Group
Technical & other details
Webinar: Fast data in times of crisis with the help of GPU4
The panel
Matej Misik
QikkDB & TellStory product owner
Ask questions &
answer polls
Get beta access
to the tools we
show
Leave us with
feedback
Tomas Synek
Moderator
Martin Zahumensky
TellStory power user
Data bases & GPU intro with QikkDB [45mins]
Intro into the deep-tech DB space
What are GPUs and how they accelerate HPC
Data story telling with TellStory [45mins]
Traditional BI vs. data story telling
Explaining Covid19 by creating a data story
Agenda for today
Let’s
GO
General intro into
the problem and to
DBs
Some of our
challenges
Real-time visitors reporting
over stream of data
30k per second
~ 2.6 billion per day
e.g. monitoring crowd
during an event, targeted
marketing
Some of our
challenges
Data science on large
datasets
Testing hypotheses and
ad-hoc querying when
indexing is not predictable
Profiling new datasets
Large flows of commuters
above 500 SIM-cards
We were looking for
solutions
Tested different technologies Elastic,
ClickHouse... not working for us very well
for various reasons
Came across GPU accelerated
computing
so?
Why not?
Elastic – slow on one node,
slow data ingest
Actian Vector – faster, but still
not performing well on one
node
Clickhouse – much faster, no
geo-spatial capabilities, only
for linux
MS-SQL – even when tuned
not fast enough
MapD (Omnisci) – considered
but far too expensive
Types of databases
By type of use:
• Transactional
• Batch
• Real-time
• Analytical
• Streaming
By using resources:
• In-memory
• Disk databases
• Hardware accelerated (FPGA, GPU,
Quantum)
Relational
Columnar Time-seriesGraph
DocumentKey-value
By stored data:
...
The technological edge – Why GPU?
GPUs for HPC (high performance computing)
~10x higher performance in
single hardware unit
Great effectiveness (cheaper
computations)
Power growing exponentially vs
linear CPU
Image
processing
Tsunami
simulation
DNA
analyses
Generic
commodi
ty HW
Available in
Cloud
AWS, Azure
Lot of processors for
parallel computing
Intel® Xeon® Platinum 8253
has 16 cores
NVDIA Tesla V100
has 5120 cores and is data
center focused
Rediscovery of Columnar Data
Storage
Utilizing GPUs computation power requires different approach to storing data.
The most suitable database architecture that works well with parallel processing
is columnar storage. In contrast to conventional relational databases which store
data in row-based format, columnar databases store data in separate columns.
In context of parallel processing, GPUs love long vectors of the same data type
FIgure 1: GPUs have thousands of arithmetic logic units (ALUs) in one piece of hardware.
CPU GPU
GPUs help to accelerate
compute-intensive use-cases
“1 GPU node replaces up to 54 CPU nodes” (NVIDIA)
New cards to be announced 2020 with approx. 8000 cores & 40% faster
Inserting a GPU into the
machine is not enough
Need to parallelize programs = hard
CUDA programming model since 2007 by Nvidia
Algorithms must be Embarrassingly parallel
Multi-GPU
How the computation is spread onto cores
GPU CUDA core A B C
Logical conditions
Records meeting the
condition
Result after
reconstruction
A>= B A < 5 Final AND mask
1st
1 5 Apple 0 1 0 - Orange
2 4 Grapes 0 1 0 - Lemon
3 3 Orange 1 1 1 Orange -
2nd
4 2 Lemon 1 1 1 Lemon -
5 1 Banana 1 0 0 - -
nth ...
Transfer data CPU RAM to GPU GPU memory – no transfers GPU to CPU RAM
SELECT C FROM FRUIT_TABLE WHERE A >= B AND A < 5
Parallel execution
1
2
n
1
n
Where is Spatio-temporal different?
Polygon Operations
Crucial requirements for the
database system
Fast insert Fast processing
Scalability & high
availability
Limit pre-aggregations
Standardized access and
common syntax
Deep-tech based on real
science
Google Protocol Buffers
Processing data on GPU is written in CUDA 10 (direct commands to HW
on single core level)
Database core is written in low level language C++ 17 (memory
management, control of instructions…)
Libraries for specific modules
(networking, building, parsing…)
Created in cooperation with Slovak
Technical University top talents
What is qikkDB for?
Filtering and aggregations over single flat huge table
Spatio-temporal data processing
Complex polygon operations (contains, intersect,
union)
Numeric and datetime data
Incremental data which are growing over time
Network utilization & analysis, Risk scoring, Dynamic pricing,
Real-time Analytics, Hypothesis verification, Profiling of big
data, Machine learning, etc.
Logs
Polygons
IoT
GPSNetwork
Events
Auto
motive
Maps
So how fast is it?
1.2B data rows in
7 columns
Average execution
time was obtained
based on 200 query
runs
Biggest datasets
tested at 400GB,
limited by Memory,
can be cached from
disk for bigger
datasets,
benchmarks to
come soon
Execution
Times Results
1. QikkDB
2. GiraffeDB
leading GPU database
3. CatDB
leading columnar database
4. RacoonDB
tuned leading relational database
CPU machine(c5d.9xlarge)36 CPU cores
We use codenames for well known
databases because for legal
reasons we can’t tell you who
these slow guys are.
GPU machine(p3.8xlarge)4x Tesla V100
Compared to Other DBs (results in ms)
Query qikkDB @
p3.8xl
qikkDB @
g4dn.12xl
GiraffeDB
@ p3.8xl
CatDB @
c5d.9xl
RacoonDB
@ c5d.9xl
Elastic
(tuned)
Spark 21x
m3xl
Spark
i3.8xl
#1 22 37 25 435 22 810 2362 22000
#2 37 82 235 1061 964 1818 3559 25000
#3 228 925 231 1630 3491 n/a 4019 27000
#4 283 1105 417 2174 3996 n/a 20412 65000
Avg 143 537 227 1325 2118 n/a 7588 34750
10
to 100x
quicker
The blazing speed
Same HW, 1.2bn data points, 2 databases
www.tellstory.cloud
Both running
on AWS
g4dn.12xlarge
48vCPU 192GB
RAM, 4x Tesla
T4 GPU
Deployed beta
platform with
data
exploration
front-end
QikkDB demo on
smart meter data
Persisted data
on disk
(compressed)
Pre-loaded
data on RAM
Relevant
columns go
to the GPU
Data on GPU RAM
(decompressed)
Result set
PCI-E
Filters &
aggregations
CUDA kernels
When inserting new data a column is automatically created ~
“schema less”, good for IoT and similar
Whats going on in the background?
Data storage & flow
How can it scale?
Multi-GPU (vertical) scalability single-node (up
to 8 GPUs)
• Accelerating computations
• Enabling multiple session
Multi-level caching
• GPU RAM cache
• CPU RAM cache
On roadmap
• Multi-nodes (horizontal) scalability
• High-availability
• Data lazy loading
Not limited to data size ~ Best performance when
data fit GPU mem, but can load from disk on demand
Why not just index?
Traditional databases use indexing for faster processing
resulting in slow insert
qikkDB does not need indexing
(but they are available anyway)
Data are just appended
GPU takes care of fast processing
Integration with your
environment on
standards you know
Kafka connector
ODBC/JDBC
Adapters
C#, Java, Python
Streaming data
Visualizationtools (PowerBI…)
Customapplications,data analysis…
Speed up your
BI tools,
applications or
use TellStory for
fast analysis
TellStory
Exploration & analysis FE
Data story tellingwith real-timedata exploration
GPU AWS
12USD/hour
GPU HW
~50k USD
Expensive
hardware?
QikkDB can handle the queries in a fraction
of the time of traditional databases, so you
can do more with your hardware
allocation in the same time.
It also means that to do the same amount
of work you need a lot less hardware and
therefore saving on costs.
“1 GPU node replaces up to 54 CPU nodes” (NVIDIA)
v
In short: Interactive analytics
on massive data sets
GPU acceleration
§ Billions of data points in milliseconds
Great for spatio-temporal data
§ Finding & understanding links between data
points in space & time
Standard SQL syntax
§ Easy to start using & integrate into the data
science environment
Efficiency & speed
§ GPUs becoming commodity HW and thanks to
their efficiency cost per 10k queries on par
with CPU approaches
GPU
Columnar
DB
Real-time
queries in
millisecs
API, ODBC,
JDBC,
connects
to
everything
SQL
standard
Spatio-
temporal
data
processing
Cloud or
on-prem
Data bases & GPU intro with QikkDB
Intro into the deep-tech DB space
What are GPUs and how they accelerate HPC
Data story telling with TellStory
Traditional BI vs. data story telling
Explaining Covid19 by creating a data story
Live stories and fast data
TellStory Roadmap
Q&A
Part 2!
Let’s
GO
Martin is ex-Instarea CEO now
working in Ataccama as Head
of Product Strategy
Martin created
https://qikk.ly/c
ovid19 story and
will lead you
through how he
did it
Interpreted data, easy to understand, with new facts
brought to reader
And once they have the story they can start to sell it to
other parties
Animated
video
playing
Story telling
A story is about being visual
Cool
Visualization
Plugins &
animations
Newspaper
like reading
&
interactive
Interesting
facts
1
2
3
Creating the Covid-19 story Live
When you want to have the story live,
you must have the data live, and when
you work with billions big data sets you
need
Fast Database
Animated
video
playing
LIVE story
Live
More features to come in Phase 2, Let AI create your Story is in progress
TellStory Roadmap
Beta release JUNE
Find interesting facts
Minute by Minute
updates
(be notified when something
interesting happens)
Animated
visualizations
(timeline charts, maps)
Share as Video
(Instagram upload, Youtube
livestream)
Google sheets
integration
Auto update data
(scheduled refresh)
Embed sections
(embedding only parts of
story will be possible)
Value
proposition
for Adastra
services
with these
tools
Quick pilots for hands on
experience
§ GPU data acceleration: 2 month pilot to
deliver real-time processing of vast
streaming data (e.g. 5G, smart meters,
transactions)
§ Data story telling: 1 month pilot to
provide customers with live &
interactive intelligence and insights
§ Data story telling: 1 month pilot to give
management the minute by minute data
they need
Q&A
Check out
www.qikk.ly
and
www.tellstory.ai
Useful links
More info
§ https://qikk.ly – product web with basic
information
§ https://qikk.ly/downloads/qikkDB_white_pa
per.pdf – White paper
§ https://docs.qikk.ly/ – Documentation &
Installation instructions
§ https://support.qikk.ly/ – Issues & Features
reporting portal
§ https://tellstory.cloud – Front-end for data
visualization, SQL console on AWS
§ https://tellstory.ai – Find out more about
TellStory

More Related Content

PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
PDF
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
ASGARD Splunk Conf 2016
Keith Kraus
 
PPTX
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 
PDF
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
inside-BigData.com
 
PDF
GPU databases - How to use them and what the future holds
Arnon Shimoni
 
PDF
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
Facultad de Informática UCM
 
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
ASGARD Splunk Conf 2016
Keith Kraus
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
inside-BigData.com
 
GPU databases - How to use them and what the future holds
Arnon Shimoni
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
Facultad de Informática UCM
 

What's hot (20)

PDF
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
PDF
Operationalizing Machine Learning Using GPU-accelerated, In-database Analytics
Kinetica
 
PDF
GPU Acceleration for Financial Services
Kinetica
 
PPTX
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
Kinetica
 
PPTX
High Performance Computing and Big Data
Geoffrey Fox
 
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
PDF
Operationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Kinetica
 
PDF
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
PPTX
Hadoop bigdata overview
harithakannan
 
PPTX
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
PDF
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
PPTX
Time Series Analytics Azure ADX
Riccardo Zamana
 
PPTX
Getting more out of your big data
Nathan Bijnens
 
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Codemotion
 
PPTX
Hug france-2012-12-04
Ted Dunning
 
PDF
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
Operationalizing Machine Learning Using GPU-accelerated, In-database Analytics
Kinetica
 
GPU Acceleration for Financial Services
Kinetica
 
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
Kinetica
 
High Performance Computing and Big Data
Geoffrey Fox
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Operationalizing Machine Learning Using GPU Accelerated, In-Database Analytics
Kinetica
 
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Mathieu Dumoulin
 
Hadoop bigdata overview
harithakannan
 
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks
 
Time Series Analytics Azure ADX
Riccardo Zamana
 
Getting more out of your big data
Nathan Bijnens
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Codemotion
 
Hug france-2012-12-04
Ted Dunning
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
VMware Tanzu
 
Ad

Similar to Fast data in times of crisis with GPU accelerated database QikkDB | Business Breakfast | 23.4.2020 (20)

PDF
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
Arnon Shimoni
 
PDF
SQL CUDA
Muhaza Liebenlito
 
PDF
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
PPTX
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Manish Harsh
 
PPTX
SQREAM DB on IBM Power9
Ganesan Narayanasamy
 
PDF
NVIDIA Rapids presentation
testSri1
 
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
PDF
GOAI: GPU-Accelerated Data Science DataSciCon 2017
Joshua Patterson
 
PPTX
Cloud Computing y Big Data, próxima frontera de la innovación
Fundación Ramón Areces
 
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
PPTX
Big Data Infrastructure and Hadoop components.pptx
GEZWARDGERALD
 
PPTX
Data In Action: Business Value of Data
Matt Turner
 
PDF
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
JAXLondon_Conference
 
PDF
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
PDF
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
PDF
Survey of Big Data Infrastructures
m.a.kirn
 
PPTX
Check Point Big Data Forum m3
Alex Fok
 
PDF
SoCal Data Science Conference: Machine Learning & Data Science in the Age of ...
Aaron Williams
 
PPTX
HPC Top 5 Stories: September 22, 2017
NVIDIA
 
PPTX
MongoDB
Stefano Coratti
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
Arnon Shimoni
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
Manish Harsh
 
SQREAM DB on IBM Power9
Ganesan Narayanasamy
 
NVIDIA Rapids presentation
testSri1
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
Joshua Patterson
 
Cloud Computing y Big Data, próxima frontera de la innovación
Fundación Ramón Areces
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
Big Data Infrastructure and Hadoop components.pptx
GEZWARDGERALD
 
Data In Action: Business Value of Data
Matt Turner
 
Hybrid solutions – combining in memory solutions with SSD - Christos Erotocritou
JAXLondon_Conference
 
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
Survey of Big Data Infrastructures
m.a.kirn
 
Check Point Big Data Forum m3
Alex Fok
 
SoCal Data Science Conference: Machine Learning & Data Science in the Age of ...
Aaron Williams
 
HPC Top 5 Stories: September 22, 2017
NVIDIA
 
Ad

Recently uploaded (20)

PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Fast data in times of crisis with GPU accelerated database QikkDB | Business Breakfast | 23.4.2020

  • 2. Adastra Group Our Solution Portfolio Webinar: Fast data in times of crisis with the help of GPU2 One Focus: Data & Digitalization Advanced Analytics (Big) Data Engineering Data Governance Cloud Services Machine Learning & AI Digital Transformation
  • 3. ADASTRA Group Adastra introduction 3 Adastra Group International consulting company that creates functional solutions in various sectors, facilitating the transition to the digital era. Cutting-edge software for data quality management, Master Data Management, and data governance. Solutions to complex business problems in risk management, sales, and process optimization. Specialist in mobile app development. Full-service creative agency based on a strong technological background. Recruitment for banks, financial institutions, telecoms and insurance companies, and many others, including Adastra. Artificial intelligence, machine learning and optimization services. Big data monetization solutions. Webinar: Fast data in times of crisis with the help of GPU
  • 4. Adastra Group Technical & other details Webinar: Fast data in times of crisis with the help of GPU4 The panel Matej Misik QikkDB & TellStory product owner Ask questions & answer polls Get beta access to the tools we show Leave us with feedback Tomas Synek Moderator Martin Zahumensky TellStory power user
  • 5. Data bases & GPU intro with QikkDB [45mins] Intro into the deep-tech DB space What are GPUs and how they accelerate HPC Data story telling with TellStory [45mins] Traditional BI vs. data story telling Explaining Covid19 by creating a data story Agenda for today Let’s GO
  • 6. General intro into the problem and to DBs
  • 7. Some of our challenges Real-time visitors reporting over stream of data 30k per second ~ 2.6 billion per day e.g. monitoring crowd during an event, targeted marketing
  • 8. Some of our challenges Data science on large datasets Testing hypotheses and ad-hoc querying when indexing is not predictable Profiling new datasets Large flows of commuters above 500 SIM-cards
  • 9. We were looking for solutions Tested different technologies Elastic, ClickHouse... not working for us very well for various reasons Came across GPU accelerated computing so? Why not? Elastic – slow on one node, slow data ingest Actian Vector – faster, but still not performing well on one node Clickhouse – much faster, no geo-spatial capabilities, only for linux MS-SQL – even when tuned not fast enough MapD (Omnisci) – considered but far too expensive
  • 10. Types of databases By type of use: • Transactional • Batch • Real-time • Analytical • Streaming By using resources: • In-memory • Disk databases • Hardware accelerated (FPGA, GPU, Quantum) Relational Columnar Time-seriesGraph DocumentKey-value By stored data: ...
  • 11. The technological edge – Why GPU? GPUs for HPC (high performance computing) ~10x higher performance in single hardware unit Great effectiveness (cheaper computations) Power growing exponentially vs linear CPU Image processing Tsunami simulation DNA analyses Generic commodi ty HW Available in Cloud AWS, Azure
  • 12. Lot of processors for parallel computing Intel® Xeon® Platinum 8253 has 16 cores NVDIA Tesla V100 has 5120 cores and is data center focused Rediscovery of Columnar Data Storage Utilizing GPUs computation power requires different approach to storing data. The most suitable database architecture that works well with parallel processing is columnar storage. In contrast to conventional relational databases which store data in row-based format, columnar databases store data in separate columns. In context of parallel processing, GPUs love long vectors of the same data type FIgure 1: GPUs have thousands of arithmetic logic units (ALUs) in one piece of hardware. CPU GPU GPUs help to accelerate compute-intensive use-cases “1 GPU node replaces up to 54 CPU nodes” (NVIDIA) New cards to be announced 2020 with approx. 8000 cores & 40% faster
  • 13. Inserting a GPU into the machine is not enough Need to parallelize programs = hard CUDA programming model since 2007 by Nvidia Algorithms must be Embarrassingly parallel
  • 14. Multi-GPU How the computation is spread onto cores GPU CUDA core A B C Logical conditions Records meeting the condition Result after reconstruction A>= B A < 5 Final AND mask 1st 1 5 Apple 0 1 0 - Orange 2 4 Grapes 0 1 0 - Lemon 3 3 Orange 1 1 1 Orange - 2nd 4 2 Lemon 1 1 1 Lemon - 5 1 Banana 1 0 0 - - nth ... Transfer data CPU RAM to GPU GPU memory – no transfers GPU to CPU RAM SELECT C FROM FRUIT_TABLE WHERE A >= B AND A < 5 Parallel execution 1 2 n 1 n
  • 15. Where is Spatio-temporal different? Polygon Operations
  • 16. Crucial requirements for the database system Fast insert Fast processing Scalability & high availability Limit pre-aggregations Standardized access and common syntax
  • 17. Deep-tech based on real science Google Protocol Buffers Processing data on GPU is written in CUDA 10 (direct commands to HW on single core level) Database core is written in low level language C++ 17 (memory management, control of instructions…) Libraries for specific modules (networking, building, parsing…) Created in cooperation with Slovak Technical University top talents
  • 18. What is qikkDB for? Filtering and aggregations over single flat huge table Spatio-temporal data processing Complex polygon operations (contains, intersect, union) Numeric and datetime data Incremental data which are growing over time Network utilization & analysis, Risk scoring, Dynamic pricing, Real-time Analytics, Hypothesis verification, Profiling of big data, Machine learning, etc. Logs Polygons IoT GPSNetwork Events Auto motive Maps
  • 19. So how fast is it? 1.2B data rows in 7 columns Average execution time was obtained based on 200 query runs Biggest datasets tested at 400GB, limited by Memory, can be cached from disk for bigger datasets, benchmarks to come soon
  • 20. Execution Times Results 1. QikkDB 2. GiraffeDB leading GPU database 3. CatDB leading columnar database 4. RacoonDB tuned leading relational database CPU machine(c5d.9xlarge)36 CPU cores We use codenames for well known databases because for legal reasons we can’t tell you who these slow guys are. GPU machine(p3.8xlarge)4x Tesla V100 Compared to Other DBs (results in ms) Query qikkDB @ p3.8xl qikkDB @ g4dn.12xl GiraffeDB @ p3.8xl CatDB @ c5d.9xl RacoonDB @ c5d.9xl Elastic (tuned) Spark 21x m3xl Spark i3.8xl #1 22 37 25 435 22 810 2362 22000 #2 37 82 235 1061 964 1818 3559 25000 #3 228 925 231 1630 3491 n/a 4019 27000 #4 283 1105 417 2174 3996 n/a 20412 65000 Avg 143 537 227 1325 2118 n/a 7588 34750 10 to 100x quicker
  • 21. The blazing speed Same HW, 1.2bn data points, 2 databases www.tellstory.cloud Both running on AWS g4dn.12xlarge 48vCPU 192GB RAM, 4x Tesla T4 GPU Deployed beta platform with data exploration front-end
  • 22. QikkDB demo on smart meter data
  • 23. Persisted data on disk (compressed) Pre-loaded data on RAM Relevant columns go to the GPU Data on GPU RAM (decompressed) Result set PCI-E Filters & aggregations CUDA kernels When inserting new data a column is automatically created ~ “schema less”, good for IoT and similar Whats going on in the background? Data storage & flow
  • 24. How can it scale? Multi-GPU (vertical) scalability single-node (up to 8 GPUs) • Accelerating computations • Enabling multiple session Multi-level caching • GPU RAM cache • CPU RAM cache On roadmap • Multi-nodes (horizontal) scalability • High-availability • Data lazy loading Not limited to data size ~ Best performance when data fit GPU mem, but can load from disk on demand
  • 25. Why not just index? Traditional databases use indexing for faster processing resulting in slow insert qikkDB does not need indexing (but they are available anyway) Data are just appended GPU takes care of fast processing
  • 26. Integration with your environment on standards you know Kafka connector ODBC/JDBC Adapters C#, Java, Python Streaming data Visualizationtools (PowerBI…) Customapplications,data analysis… Speed up your BI tools, applications or use TellStory for fast analysis TellStory Exploration & analysis FE Data story tellingwith real-timedata exploration
  • 27. GPU AWS 12USD/hour GPU HW ~50k USD Expensive hardware? QikkDB can handle the queries in a fraction of the time of traditional databases, so you can do more with your hardware allocation in the same time. It also means that to do the same amount of work you need a lot less hardware and therefore saving on costs. “1 GPU node replaces up to 54 CPU nodes” (NVIDIA)
  • 28. v In short: Interactive analytics on massive data sets GPU acceleration § Billions of data points in milliseconds Great for spatio-temporal data § Finding & understanding links between data points in space & time Standard SQL syntax § Easy to start using & integrate into the data science environment Efficiency & speed § GPUs becoming commodity HW and thanks to their efficiency cost per 10k queries on par with CPU approaches GPU Columnar DB Real-time queries in millisecs API, ODBC, JDBC, connects to everything SQL standard Spatio- temporal data processing Cloud or on-prem
  • 29. Data bases & GPU intro with QikkDB Intro into the deep-tech DB space What are GPUs and how they accelerate HPC Data story telling with TellStory Traditional BI vs. data story telling Explaining Covid19 by creating a data story Live stories and fast data TellStory Roadmap Q&A Part 2! Let’s GO
  • 30. Martin is ex-Instarea CEO now working in Ataccama as Head of Product Strategy Martin created https://qikk.ly/c ovid19 story and will lead you through how he did it
  • 31. Interpreted data, easy to understand, with new facts brought to reader And once they have the story they can start to sell it to other parties Animated video playing Story telling
  • 32. A story is about being visual Cool Visualization Plugins & animations Newspaper like reading & interactive Interesting facts
  • 34. When you want to have the story live, you must have the data live, and when you work with billions big data sets you need Fast Database Animated video playing LIVE story Live
  • 35. More features to come in Phase 2, Let AI create your Story is in progress TellStory Roadmap Beta release JUNE Find interesting facts Minute by Minute updates (be notified when something interesting happens) Animated visualizations (timeline charts, maps) Share as Video (Instagram upload, Youtube livestream) Google sheets integration Auto update data (scheduled refresh) Embed sections (embedding only parts of story will be possible)
  • 36. Value proposition for Adastra services with these tools Quick pilots for hands on experience § GPU data acceleration: 2 month pilot to deliver real-time processing of vast streaming data (e.g. 5G, smart meters, transactions) § Data story telling: 1 month pilot to provide customers with live & interactive intelligence and insights § Data story telling: 1 month pilot to give management the minute by minute data they need
  • 38. Useful links More info § https://qikk.ly – product web with basic information § https://qikk.ly/downloads/qikkDB_white_pa per.pdf – White paper § https://docs.qikk.ly/ – Documentation & Installation instructions § https://support.qikk.ly/ – Issues & Features reporting portal § https://tellstory.cloud – Front-end for data visualization, SQL console on AWS § https://tellstory.ai – Find out more about TellStory