SlideShare a Scribd company logo
This is not a contribution
Evan Chan June 2016
•
700 Updatable Queries Per Second:
•
Spark as a Real-Time Web Service
This is not a contribution
Who Am I?
User and contributor to Spark since 0.9, Cassandra
since 0.6
Datastax Cassandra MVP
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit, Strata, Scala
Days, etc.
This is not a contribution
Apache Spark
•
Usually used for rich analytics, not time-critical.
Machine learning: generating models, predictions, etc.
SQL Queries seconds to minutes, low concurrency
Stream processing
•
What about for low-latency, highly concurrent queries?
Dashboards?
This is not a contribution
Low-Latency Web Queries
•
Why is it important?
Dashboards
Interactive analytics
Real-time data processing
•
Why not use the Spark stack for this?
This is not a contribution
Web Query Stack
Web Client / JS
App Server
RDBMS
This is not a contribution
Spark-based Low-Latency Stack
Web Client / JS
???
Spark
This is not a contribution
Creating a new SparkContext is S-L-O-W
Start up HTTP/BitTorrent File Server
Start up UI
Start up executor processes and wait for
confirmation
•
The bigger the cluster, the slower!
This is not a contribution
Using a Persistent Context for Low Latency
Avoid high overhead of Spark application launch
Standard pattern:
Spark Job Server
Hive Thrift Server
Accept queries and run them in context
Usually means fixed resources - great for SLA
predictability
This is not a contribution
FAIR Scheduling
FIFO vs FAIR Scheduling
FAIR scheduler can co-schedule concurrent Spark
jobs even if they take up lots of resources
Scheduler pools with individual policies
Higher concurrency
FIFO allows concurrency if tasks do not use up all
threads
In Mesos, use coarse-grained mode to avoid
launching executors on every Spark task
This is not a contribution
Low-Latency Game Plan
Start a persistent Spark Context (ex. the Hive
ThriftServer - we’ll get to that below)
Run it in FAIR scheduler mode
Use fast in-memory storage
Maximize concurrency by using as few partitions/
threads as possible
Host the data and run it on a single node - avoid
expensive network shuffles
This is not a contribution
In-Memory Storage
Is it really faster than on disk files? With OS Caching
It's about consistency of performance - not just hot
data in the page cache, but ALL data.
Fast random access
Making different tradeoffs as new memory
technologies emerge (NVRAM etc.)
Higher IO -> less need for compression
Apache Arrow
This is not a contribution
•
So, let’s talk about Spark storage in
detail…
This is not a contribution
HDFS? Parquet Files?
Column pruning speeds up I/O significantly
Still have to scan lots of files
File organization not the easiest for filtering
For low-latency, need much more fine-grained
indexing
This is not a contribution
Cached RDDs
•
Let's say you have an RDD[T], where each item is of type T.
Bytes are saved on JVM heap, or optionally heap + disk
Spark optionally serializes it, using by default Java
serialization, so it (hopefully) takes up less space
Pros: easy (myRdd.cache())
Cons: have to iterate over every item, no column
pruning, slow if need to deserialize, memory hungry,
cannot update
This is not a contribution
Cached DataFrames
•
Works on a DataFrame (RDD[Row] with a schema)
•
sqlContext.cacheTable(tableA)
Uses columnar storage for very efficient storage
Columnar pruning for faster querying
Pros: easy, efficient memory footprint, fast!
Cons: no filtering, cannot update
This is not a contribution
Why are Updates Important?
Appends
Streaming workloads. Add new data continuously.
Real data is *always* changing. Queries on live
real-time data has business benefits.
Updates
Idempotency = really simple ingestion pipelines
Simpler streaming later
update late events (See Spark 2.0 Structured
Streaming)
This is not a contribution
Advantages of Filtering
Two methods to lower query latency:
Scan data faster (in-memory)
Scan less data (filtering)
RDDs and cached DFs - prune by partition
Dynamo/BigTable - 2D Filtering
Filter by partition
Filter within partitions
This is not a contribution
Workarounds - Updating RDDs
Union(oldRDD, newRDD)
Creates a tree of RDDs - slows down queries
significantly
IndexedRDD
This is not a contribution
•
Introducing FiloDB.
•
A distributed, versioned, columnar
analytics database.
•
Built for streaming.
This is not a contribution
Fast Analytics Storage
Scan speeds competitive with Apache Parquet
In-memory version significantly faster
Flexible filtering along two dimensions
Much more efficient and flexible partition key
filtering
Efficient columnar storage using dictionary encoding
and other techniques
Updatable
This is not a contribution
Comparing Storage Costs and Query Speeds
•
https://www.oreilly.com/ideas/apache-cassandra-for-
analytics-a-performance-and-storage-analysis
This is not a contribution
Robust Distributed Storage
•
In-memory storage engine, or
•
Apache Cassandra as the rock-solid storage engine.
This is not a contribution
Cassandra-like Data Model
partition keys - distributes data around a cluster, and
allows for fine grained and flexible filtering
segment keys - do range scans within a partition, e.g. by
time slice
primary key based ingestion and updates
Column A Column B
Partition Key 1 Segment 1 Segment 2 Segment 1 Segment 2
Partition Key 2 Segment 1 Segment 2 Segment 1 Segment 2
This is not a contribution
Very Flexible Filtering
•
Unlike Cassandra, FiloDB offers very flexible and
efficient filtering on partition keys. Partial key matches,
fast IN queries on any part of the partition key.
•
No need to write multiple tables to work around
answering different queries.
This is not a contribution
Spark SQL Queries!
•
- Read to and write from Spark Dataframes
•
- Append/merge to FiloDB table from Spark Streaming
•
- Use Tableau or any other JDBC tool
CREATE TABLE gdelt USING filodb.spark OPTIONS (dataset "gdelt");
SELECT Actor1Name, Actor2Name, AvgTone FROM gdelt ORDER BY AvgTone
DESC LIMIT 15;
INSERT INTO gdelt SELECT * FROM NewMonthData;
This is not a contribution
What’s in the Name?
•
Rich, sweet layers of distributed, versioned database
goodness
This is not a contribution
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
This is not a contribution
SMACK stack for all your analytics
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted / scored data
This is not a contribution
Message
Queue
Events
Spark
Streaming
Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data
This is not a contribution
•
Fast SQL Server in Spark
This is not a contribution
Data:The New York City Taxi Dataset
•
The public NYC Taxi Dataset contains telemetry (pickup, dropoff locations, times)
info on millions of taxi rides in NYC.
Partition key - :stringPrefix medallion 2 - hash multiple drivers trips into
~300 partitions
Segment key - :timeslice pickup_datetime 6d
Row key - hack_license, pickup_datetime
•
Allows for easy filtering by individual drivers, and slicing by time.
Medallion Prefix 1/1 - 1/6 1/7 - 1/12
AA records records
AB records records
This is not a contribution
collectAsync
•
To support running concurrent queries better, we rely on a
relatively unknown feature of Spark's RDD API, collectAync:
•
sqlContext.sql(queryString).rdd.collectAsync
•
This returns a Scala Future, which can easily be composed
using Future.sequence to launch a whole series of
asynchronous RDD operations. They will be executed with
the help of a separate ForkJoin thread pool.
This is not a contribution
Initial Results
Run lots of queries concurrently using collectAsync
Spark local[*] mode
SQL queries on first million rows of NYC Taxi dataset
50 Queries per Second
Most of time not running queries but parsing SQL !
This is not a contribution
Some Observations
1. Starting up a Spark task is actually pretty low
latency - milliseconds
2. One huge benefit to filtering is reduced thread/CPU
usage. Most of the queries ended up being single
partition / single thread.
This is not a contribution
Lessons
1. Cache the SQL to DataFrame/LogicalPlan parsing.
This saves ~20ms per parse, which is not
insignificant for low-latency apps
2. Distribute the SQL parsing away from the main
thread so it's not gated by one thread
This is not a contribution
SQL Plan Caching
•
Cache the `DataFrame` containing the logical plan translated from
parsing SQL.
•
Now - **700 QPS**!!
val cachedDF = new collection.mutable.HashMap[String, DataFrame]
def getCachedDF(query: String): DataFrame =
cachedDF.getOrElseUpdate(query, sql.sql(query))
This is not a contribution
Scaling with More Data
•
15 million rows of NYC Taxi data - **still 700 QPS**!
•
This makes sense due to the efficiency of querying.
This is not a contribution
Fast Spark Query Stack
Run Spark context on heap with `local[*]`
Load FiloDB-Spark connector, load data in memory
Very fast queries all in process
Front end app
FiloDB-Spark
SparkContext
InMemoryColumnStore
This is not a contribution
Fast Spark Query Stack II
HTTP/REST using Spark Job Server
JS app
Spark Job Server
FiloDB-Spark
SparkContext
InMemoryColumnStoreHTTP / REST
This is not a contribution
Slower: Hive Thrift Server Stack
BI Client
Hive Thrift Server
Spark
JDBC
FiloDB-Spark
SQLContext
Hive MetaStore
This is not a contribution
Your Contributions Welcome!
•
http://github.com/tuplejump/FiloDB

More Related Content

PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
Flink vs. Spark
Slim Baltagi
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Data warehousing with Hadoop
hadooparchbook
 
Deep Dive: Memory Management in Apache Spark
Databricks
 

What's hot (20)

PPTX
Migrating with Debezium
Mike Fowler
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PPTX
Spark
Heena Madan
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PDF
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PDF
What Is RDD In Spark? | Edureka
Edureka!
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
An Introduction To Oracle Database
Meysam Javadi
 
PDF
Data Migration with Spark to Hive
Databricks
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Migrating with Debezium
Mike Fowler
 
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Databricks Platform.pptx
Alex Ivy
 
Apache Spark overview
DataArt
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
What Is RDD In Spark? | Edureka
Edureka!
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
An Introduction To Oracle Database
Meysam Javadi
 
Data Migration with Spark to Hive
Databricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Ad

Similar to 700 Queries Per Second with Updates: Spark As A Real-Time Web Service (20)

PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
PDF
Not Your Father's Database by Vida Ha
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Not Your Father's Database by Databricks
Caserta
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PPTX
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
Not Your Father's Database by Vida Ha
Spark Summit
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Not Your Father's Database by Databricks
Caserta
 
Spark what's new what's coming
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
Intro to Spark development
Spark Summit
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Introduction to Spark Training
Spark Summit
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
PPTX
Web_Engineering_Assignment_Clean.pptxfor college
HUSNAINAHMAD39
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Chad Readey - An Independent Thinker
Chad Readey
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
International-health-agency and it's work.pptx
shreehareeshgs
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
Web_Engineering_Assignment_Clean.pptxfor college
HUSNAINAHMAD39
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

  • 1. This is not a contribution Evan Chan June 2016 • 700 Updatable Queries Per Second: • Spark as a Real-Time Web Service
  • 2. This is not a contribution Who Am I? User and contributor to Spark since 0.9, Cassandra since 0.6 Datastax Cassandra MVP Created Spark Job Server and FiloDB Talks at Spark Summit, Cassandra Summit, Strata, Scala Days, etc.
  • 3. This is not a contribution Apache Spark • Usually used for rich analytics, not time-critical. Machine learning: generating models, predictions, etc. SQL Queries seconds to minutes, low concurrency Stream processing • What about for low-latency, highly concurrent queries? Dashboards?
  • 4. This is not a contribution Low-Latency Web Queries • Why is it important? Dashboards Interactive analytics Real-time data processing • Why not use the Spark stack for this?
  • 5. This is not a contribution Web Query Stack Web Client / JS App Server RDBMS
  • 6. This is not a contribution Spark-based Low-Latency Stack Web Client / JS ??? Spark
  • 7. This is not a contribution Creating a new SparkContext is S-L-O-W Start up HTTP/BitTorrent File Server Start up UI Start up executor processes and wait for confirmation • The bigger the cluster, the slower!
  • 8. This is not a contribution Using a Persistent Context for Low Latency Avoid high overhead of Spark application launch Standard pattern: Spark Job Server Hive Thrift Server Accept queries and run them in context Usually means fixed resources - great for SLA predictability
  • 9. This is not a contribution FAIR Scheduling FIFO vs FAIR Scheduling FAIR scheduler can co-schedule concurrent Spark jobs even if they take up lots of resources Scheduler pools with individual policies Higher concurrency FIFO allows concurrency if tasks do not use up all threads In Mesos, use coarse-grained mode to avoid launching executors on every Spark task
  • 10. This is not a contribution Low-Latency Game Plan Start a persistent Spark Context (ex. the Hive ThriftServer - we’ll get to that below) Run it in FAIR scheduler mode Use fast in-memory storage Maximize concurrency by using as few partitions/ threads as possible Host the data and run it on a single node - avoid expensive network shuffles
  • 11. This is not a contribution In-Memory Storage Is it really faster than on disk files? With OS Caching It's about consistency of performance - not just hot data in the page cache, but ALL data. Fast random access Making different tradeoffs as new memory technologies emerge (NVRAM etc.) Higher IO -> less need for compression Apache Arrow
  • 12. This is not a contribution • So, let’s talk about Spark storage in detail…
  • 13. This is not a contribution HDFS? Parquet Files? Column pruning speeds up I/O significantly Still have to scan lots of files File organization not the easiest for filtering For low-latency, need much more fine-grained indexing
  • 14. This is not a contribution Cached RDDs • Let's say you have an RDD[T], where each item is of type T. Bytes are saved on JVM heap, or optionally heap + disk Spark optionally serializes it, using by default Java serialization, so it (hopefully) takes up less space Pros: easy (myRdd.cache()) Cons: have to iterate over every item, no column pruning, slow if need to deserialize, memory hungry, cannot update
  • 15. This is not a contribution Cached DataFrames • Works on a DataFrame (RDD[Row] with a schema) • sqlContext.cacheTable(tableA) Uses columnar storage for very efficient storage Columnar pruning for faster querying Pros: easy, efficient memory footprint, fast! Cons: no filtering, cannot update
  • 16. This is not a contribution Why are Updates Important? Appends Streaming workloads. Add new data continuously. Real data is *always* changing. Queries on live real-time data has business benefits. Updates Idempotency = really simple ingestion pipelines Simpler streaming later update late events (See Spark 2.0 Structured Streaming)
  • 17. This is not a contribution Advantages of Filtering Two methods to lower query latency: Scan data faster (in-memory) Scan less data (filtering) RDDs and cached DFs - prune by partition Dynamo/BigTable - 2D Filtering Filter by partition Filter within partitions
  • 18. This is not a contribution Workarounds - Updating RDDs Union(oldRDD, newRDD) Creates a tree of RDDs - slows down queries significantly IndexedRDD
  • 19. This is not a contribution • Introducing FiloDB. • A distributed, versioned, columnar analytics database. • Built for streaming.
  • 20. This is not a contribution Fast Analytics Storage Scan speeds competitive with Apache Parquet In-memory version significantly faster Flexible filtering along two dimensions Much more efficient and flexible partition key filtering Efficient columnar storage using dictionary encoding and other techniques Updatable
  • 21. This is not a contribution Comparing Storage Costs and Query Speeds • https://www.oreilly.com/ideas/apache-cassandra-for- analytics-a-performance-and-storage-analysis
  • 22. This is not a contribution Robust Distributed Storage • In-memory storage engine, or • Apache Cassandra as the rock-solid storage engine.
  • 23. This is not a contribution Cassandra-like Data Model partition keys - distributes data around a cluster, and allows for fine grained and flexible filtering segment keys - do range scans within a partition, e.g. by time slice primary key based ingestion and updates Column A Column B Partition Key 1 Segment 1 Segment 2 Segment 1 Segment 2 Partition Key 2 Segment 1 Segment 2 Segment 1 Segment 2
  • 24. This is not a contribution Very Flexible Filtering • Unlike Cassandra, FiloDB offers very flexible and efficient filtering on partition keys. Partial key matches, fast IN queries on any part of the partition key. • No need to write multiple tables to work around answering different queries.
  • 25. This is not a contribution Spark SQL Queries! • - Read to and write from Spark Dataframes • - Append/merge to FiloDB table from Spark Streaming • - Use Tableau or any other JDBC tool CREATE TABLE gdelt USING filodb.spark OPTIONS (dataset "gdelt"); SELECT Actor1Name, Actor2Name, AvgTone FROM gdelt ORDER BY AvgTone DESC LIMIT 15; INSERT INTO gdelt SELECT * FROM NewMonthData;
  • 26. This is not a contribution What’s in the Name? • Rich, sweet layers of distributed, versioned database goodness
  • 27. This is not a contribution Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 28. This is not a contribution SMACK stack for all your analytics Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 29. This is not a contribution Message Queue Events Spark Streaming Models Cassandra FiloDB: Long term event storage Spark Learned Data
  • 30. This is not a contribution • Fast SQL Server in Spark
  • 31. This is not a contribution Data:The New York City Taxi Dataset • The public NYC Taxi Dataset contains telemetry (pickup, dropoff locations, times) info on millions of taxi rides in NYC. Partition key - :stringPrefix medallion 2 - hash multiple drivers trips into ~300 partitions Segment key - :timeslice pickup_datetime 6d Row key - hack_license, pickup_datetime • Allows for easy filtering by individual drivers, and slicing by time. Medallion Prefix 1/1 - 1/6 1/7 - 1/12 AA records records AB records records
  • 32. This is not a contribution collectAsync • To support running concurrent queries better, we rely on a relatively unknown feature of Spark's RDD API, collectAync: • sqlContext.sql(queryString).rdd.collectAsync • This returns a Scala Future, which can easily be composed using Future.sequence to launch a whole series of asynchronous RDD operations. They will be executed with the help of a separate ForkJoin thread pool.
  • 33. This is not a contribution Initial Results Run lots of queries concurrently using collectAsync Spark local[*] mode SQL queries on first million rows of NYC Taxi dataset 50 Queries per Second Most of time not running queries but parsing SQL !
  • 34. This is not a contribution Some Observations 1. Starting up a Spark task is actually pretty low latency - milliseconds 2. One huge benefit to filtering is reduced thread/CPU usage. Most of the queries ended up being single partition / single thread.
  • 35. This is not a contribution Lessons 1. Cache the SQL to DataFrame/LogicalPlan parsing. This saves ~20ms per parse, which is not insignificant for low-latency apps 2. Distribute the SQL parsing away from the main thread so it's not gated by one thread
  • 36. This is not a contribution SQL Plan Caching • Cache the `DataFrame` containing the logical plan translated from parsing SQL. • Now - **700 QPS**!! val cachedDF = new collection.mutable.HashMap[String, DataFrame] def getCachedDF(query: String): DataFrame = cachedDF.getOrElseUpdate(query, sql.sql(query))
  • 37. This is not a contribution Scaling with More Data • 15 million rows of NYC Taxi data - **still 700 QPS**! • This makes sense due to the efficiency of querying.
  • 38. This is not a contribution Fast Spark Query Stack Run Spark context on heap with `local[*]` Load FiloDB-Spark connector, load data in memory Very fast queries all in process Front end app FiloDB-Spark SparkContext InMemoryColumnStore
  • 39. This is not a contribution Fast Spark Query Stack II HTTP/REST using Spark Job Server JS app Spark Job Server FiloDB-Spark SparkContext InMemoryColumnStoreHTTP / REST
  • 40. This is not a contribution Slower: Hive Thrift Server Stack BI Client Hive Thrift Server Spark JDBC FiloDB-Spark SQLContext Hive MetaStore
  • 41. This is not a contribution Your Contributions Welcome! • http://github.com/tuplejump/FiloDB