0% found this document useful (0 votes)
74 views18 pages

Big Data Computations - Comparing Apache HAWQ, Druid, and GPU Databases Presentation

Uploaded by

Sergio Bruno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views18 pages

Big Data Computations - Comparing Apache HAWQ, Druid, and GPU Databases Presentation

Uploaded by

Sergio Bruno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Distributed Computations for Analytics:

Benchmarking Druid, HAWQ and Kinetica


M AY, 2 0 1 7

D R . V I J AY S R I N I VA S A G N E E S WA R A N ,
D I R E C T O R A N D H E A D O F D ATA T E C H N O L O G I E S ,
SAPIENTRAZORFISH
Distributed Merge Tree
Pel_Count
Category Merged list of Pel_ID’s

Pel_Count Pel_Count Pel_Count


Sub- Sub- Sub-
Merged list of Merged list of Merged list of
Category Category Category
Pel_ID’s Pel_ID’s Pel_ID’s

Domain Domain Domain Domain Domain

Pel_Count Pel_Count Pel_Count Pel_Count Pel_Count


list of Pel_ID’s list of Pel_ID’s list of Pel_ID’s list of Pel_ID’s list of Pel_ID’s

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 2


Apache Druid+S3 Deep Storage

SWEET SPOT:

 Real-time ad-hoc/analytical queries Broker


 Viewed as an operations/network monitoring, video, online advertising analytical

database

EXTERNAL DEPENDENCIES:

S3 Deep Storage
 Deep Storage – S3 Zookeeper and Coordinator
 Metadata Store- derby

Historical 1

Historical 2

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 3


Apache HAWQ

Source of above diagram:


https://hortonworks.com/wp-content/uploads/2015/02/Introducing-the-Newly-Redesigned-Apache-HAWQ-Incubating-1.pdf

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 4


Apache Hawq: Custom
Architecture

HAWQ uses a massively parallel architecture (MPP) and runs its own services, called Segments, to access data in HDFS directly. It is also the only other SQL on Hadoop solution to provide ANSI SQL
compliance.

Broker

HAWQ/ Application Master

HAWQ Master HAWQ Master

HDFS Data Node HDFS Data Node

HAWQ HAWQ HAWQ HAWQ HAWQ


Segment Host Segment Host Segment Host Segment Host Segment Host

HDFS Data Node HDFS Data Node HDFS Data Node HDFS Data Node HDFS Data Node

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 5


Kinetica Architecture

 Built in connectors simplify integration with most common

open source frameworks Hbase, Kibana, Kafka, Mapreduce,


Spark and Strom
 Real time and Batch Data from various sources can be

parallelly ingested into kinetica using these connectors.


 User-defined functions (UDFs) can receive table data, do

arbitrary computations, and save output to a separate table in


a distributed manner.
 UDFs have direct access to CUDA APIs – enables compute-

to-grid analytics for logic deployed within Kinetica.


 ODBC/JDBC drivers provides Seamless integration with

existing visualization and business intelligence tools like


Tableau & caravel
 Kinetica Revel also provides interface for SQL for adhoc

querying.

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 6


Partitioning Data Across a Cluster of Nodes

https://docs.microsoft.com/en-us/aspnet/aspnet/overview/developing-apps-with-windows-azure/building-real-world-cloud-apps-with-windows-azure/data-partitioning-strategies

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 7


Multi-dimensional Partitioning Strategy

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

1-30 0 1 2 3 4 5 6 7 0 1 2 3

31-60

61-90

91-120

……

Simple 2-d grid partitioning.

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 8


Hibert Curve Allocation Method (HCAM)

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 9


General Multi-dimensional Partitioning (GeMDA)

Yu-Lung Lo, Kien A. Hua, and Honesty C. Young. 2001. GeMDA: A Multidimensional Data Partitioning
Technique for Multiprocessor Database Systems. Distrib. Parallel Databases 9, 3 (May 2001), 211-236.
DOI=http://dx.doi.org/10.1023/A:1019265612794
COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 10
Multi-dimensional Partitioning: Cluster Architecture

A P P L I C AT I O N S E RV E R S I D E
Partition (1 to 35)

Slave InMemory
Meta
Database
App Server Data

Partition (36 to 70)


AWS
InMemory
Load Master
Slave
Database S3
Balancer Rest Meta
Client UI Rest Rest Data
Meta
Data Partition (71 to 100)

Slave InMemory
Database
App Server Meta
Data

Partition (xx to xx)

Slave InMemory
Database
Meta
Data

Notes:

 Each app server sends requests to master with all the parameters  Master will analyse the input and created separate requests for each slaves depending on partitions

 Master will have full metadata (Category>Sub-Category>Domain tree)  Master send separate requests to slaves and collates all the responses and returns the output

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 11


Data Model: Flat table

Person_ID (Represents
Domain ID Session_ID Session_Duration Date
Pel_ID)

Multi-dimensional partitioning: date, domain

Date
Range
1-30 0 1 2 3 4 5 6 7 0 1 2 3
31-60
61-90
91-120
……

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 12


Apache HAWQ

M U LT I - D I M E N S I O N A L OPTIMIZED DISTRIBUTED
PA RT I T I O N I N G QUERIES

Sub-partition Partition
Partition by Dynamic Scan
by Selector

Lyublena Antova, Amr El-Helw, Mohamed A. Soliman, Zhongxian Gu, Michalis Petropoulos, and Florian Waas. 2014. Optimizing queries over partitioned tables in MPP systems.
In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 373-384. DOI:
https://doi.org/10.1145/2588555.2595640
COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 13
Performance Studies
1 Month Range Queries for Specific Action IDs.

Reachability metric
Timeframe Cluster Config Table Content Table Record Count Data load time/method computation time (average of
3 runs)

TV Ranker – program ids


4 ec2 R3.4X machines 30 Mins(Druid Ingest
and pel_ids TV Ranker – 229 million
Druid Nov 15 (2 storage nodes, 1 broker Proc) 1.135 minutes
Filter criteria – action_ids Filter criteria – 240 million
node and 1 zookeeper) And 12 minutes
and pel_ids

3 ec2 R3.4X machines


HAWQ Nov 15 (2 clusters and 1 application same Same 15 seconds (gpfdist) 20.33 seconds
master)

5 mins (custom Spark 5.74 seconds


1 EC2 P2.8X large(8GPU, 32
Kinetica Nov 15 Same Same connector with 30
CPU’s, 488GB RAM)
threads/splits) 3.59 secs on 2 P2.8X node cluster

TV Ranker – 122 million 14 minutes and 16


Druid Dec 15 Same as above Same 1.153 minutes
Filter Criteria – 259 million minutes

8 seconds and 12
HAWQ Dec 15 Same as above Same Same 12.7 seconds
seconds

5.21 seconds
Kinetica Dec 15 Same as above Same Same 5 minutes
3.65 secs on 2 P2.8X node cluster

Action IDs: 8137, 7752, 8045, 8031, 8003, 7987 for Druid, HAWQ and Kinetica
COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 14
Performance Studies

1 Month Range Queries for Specific Action IDs.

Reachability metric
Timeframe Cluster Config Table Content Table Record Count Data load time/method computation time (average of
3 runs)

TV Ranker – program ids


4 ec2 R3.4X machines 30 Mins(Druid Ingest
and pel_ids TV Ranker – 229 million
Druid 1 day (2 storage nodes, 1 broker Proc) 2.76 seconds
Filter criteria – action_ids Filter criteria – 240 million
node and 1 zookeeper) And 12 minutes
and pel_ids

3 ec2 R3.4X machines


HAWQ 1 day (2 clusters and 1 application same Same 15 seconds (gpfdist) 4.93 seconds
master)

5 mins (custom Spark


1 EC2 P2.8X large(8GPU, 32
Kinetica 1 day CPU’s, 488GB RAM) Same Same connector with 30 2.64 seconds
threads/splits)

Same as above TV Ranker – 122 million 14 minutes and 16


Druid 1 week Same 20.66 seconds
Filter Criteria – 259 million minutes

Same as above 8 seconds and 12


HAWQ 1 week Same Same 10.33 seconds
seconds

Kinetica 1 week Same as above Same Same 5 minutes 5.21 seconds

Action IDs: 8137, 7752, 8045, 8031, 8003, 7987 for both Druid, HAWQ and Kinetica
COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 15
Performance Studies

Query for All Action IDs.

Cluster Details Execution time


Action IDs
Record
Table Content Time frame Hawq Druid Hawq Druid Kinetica
count

Tv Ranker Program id's and pels 122 Million Dec' 15


3 ec2 R3.4X machines 4 ec2 R3.4X machines 2.45 secs on 2
(2 clusters and 1 application (2 storage nodes, 1 broker All (22, 000) 83 seconds Indeterminate 6.45 seconds node cluster
master) node and 1 zookeeper)
Filter Criteria Action id's and pels 259 Million Dec' 15

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 16


Credits

Sapient
• Raghunadh Nittala
• Arunkumar Ramnatha
Kinetica
• Sunil Madhusoodhan Nair • James Mesney
• Vikram Hedge • Charles Sutton
• Sreeman Narayana • Matt Hawkins
• Ritesh Soni
• IDIOM team – ZZ and others
• Prashant Mehta

COPYRIGHT SAPIENTRAZORFISH | CONFIDENTIAL 17


THANK YOU!
VA G N E E S WA R A N @ S A P I E N T. C O M

You might also like