SlideShare a Scribd company logo
Extending the Yahoo!
Streaming Benchmark
Jamie Grier
@jamiegrier
jamie@data-artisans.com
Who am I?
• Director of Applications Engineering at data
Artisans
• Previously working on streaming computation at
Twitter, Gnip and Boulder Imaging
• Involved in various kinds of stream processing for
about a decade
• High-speed video, social media streaming, general
frameworks for stream processing
Overview
• Yahoo! performed a benchmark comparing
Apache Flink, Storm and Spark
• The benchmark never actually pushed Flink to it’s
throughput limits but stopped at Storms limits
• I knew Flink was capable of much more so I
repeated the benchmarks myself
• I did a follow up blog post explaining my findings
and will summarize them here
Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit current value of window aggregates to
Redis every second for query
• Map ads to campaigns using Redis as well
Any questions so far?
Storm Code
Flink Code
Hardware Specs
• 10 Kafka brokers with 2 partitions each
• 10 compute nodes (Flink / Storm)
• Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU
• 4 cores w/ hyperthreading
• 32 GB RAM (only 8GB allocated to JVMs)
• 10 GigE Ethernet between compute nodes
• 1 GigE Ethernet between Kafka cluster and compute nodes
Logical Deployment
Data
Generat
or
Kafka Source Filter Project Join
Redis
Windo
w
Sink Redis
Stream Processor
Redis
Apache Storm
Deployment
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter / Project Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Source / Filter / Project / Join Window Sink
Flink
Data Generator
Redis
Shuffle
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Window / Sink
Flink
Data Generator
Redis
Shuffle
Source / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Apache Flink
Deployment
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Processing Guarantees
Apples and Oranges
Apache Storm Apache Flink
At least once
semantics
Exactly once
semantics
Double counting after
failures
No double counting
Lost state after
failures
No state loss
Benchmark
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
Baseline
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Redis
Bottleneck Analysis
Apache Storm
Kafka
Kafka
Kafka
Source Filter Project Join Window Sink
Flink
Data Generator
Shuffle
Apache Storm
10 Gige Link
1 Gige Link
Redis
Redis
CPU
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Bottleneck Analysis
Apache Flink
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Network
Redis
Kafka
Kafka
Kafka
Flink
Data Generator
Redis
Shuffle
Eliminate the
Bottleneck
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Redis
Flink
Data Generator
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Eliminate the
Bottleneck
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Eliminate the
Bottleneck
Redis
Redis
Shuffle
Apache Flink
Window / SinkSource / Filter / Project / Join
10 Gige Link
1 Gige Link
Data
Generator
Apache Flink
Deployment
Round 2
Benchmark
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
Baseline
Benchmark
Round 2
0 4,000,000 8,000,000 12,000,000 16,000,000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
Results
• Apache Flink achieved 15 million messages / sec
on Yahoo! benchmark
• Much stronger processing guarantees: Exactly
once
• 80x higher than what was reported in the original
Yahoo! benchmark on similar hardware
Questions?
Storm Compatibility
• Lot’s of companies already have applications written
using the Storm API
• Flink provides a Storm compatibility layer
• Run your Storm jobs on Flink with a one line code
change
• Flink also allows you to reuse your existing Storm
spout and bolt code from a Flink job
• Give it a try!
Thanks!

More Related Content

PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PDF
Big Data Warsaw
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Big Data Warsaw
Apache Flink: Streaming Done Right @ FOSDEM 2016
QCon London - Stream Processing with Apache Flink
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Aljoscha Krettek - The Future of Apache Flink
Jamie Grier - Robust Stream Processing with Apache Flink

What's hot (20)

PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PDF
Streaming Analytics & CEP - Two sides of the same coin?
PPTX
Streaming in the Wild with Apache Flink
PPTX
Flink. Pure Streaming
PPTX
Robust Stream Processing with Apache Flink
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PPTX
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
PPTX
The Evolution of (Open Source) Data Processing
PDF
Stream Processing with Apache Flink
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
January 2016 Flink Community Update & Roadmap 2016
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PPTX
Data Stream Processing with Apache Flink
PDF
Marton Balassi – Stateful Stream Processing
PDF
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Streaming Analytics & CEP - Two sides of the same coin?
Streaming in the Wild with Apache Flink
Flink. Pure Streaming
Robust Stream Processing with Apache Flink
K. Tzoumas & S. Ewen – Flink Forward Keynote
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Christian Kreuzfeld – Static vs Dynamic Stream Processing
The Evolution of (Open Source) Data Processing
Stream Processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
Taking a look under the hood of Apache Flink's relational APIs.
January 2016 Flink Community Update & Roadmap 2016
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Data Stream Processing with Apache Flink
Marton Balassi – Stateful Stream Processing
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Ad

Viewers also liked (10)

PPTX
Stateful Stream Processing at In-Memory Speed
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
PPTX
From distributed caches to in-memory data grids
PPTX
Resource Aware Scheduling in Apache Storm
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
Realtime Analytics with Storm and Hadoop
PPTX
Yahoo compares Storm and Spark
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
Stateful Stream Processing at In-Memory Speed
Extending The Yahoo Streaming Benchmark to Apache Apex
From distributed caches to in-memory data grids
Resource Aware Scheduling in Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Apache Storm 0.9 basic training - Verisign
Hadoop Summit Europe 2014: Apache Storm Architecture
Ad

Similar to Extending the Yahoo Streaming Benchmark (20)

PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PPTX
Performance Comparison of Streaming Big Data Platforms
PDF
Data Streaming For Big Data
PDF
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
Santander Stream Processing with Apache Flink
PPTX
Stream processing at Hotstar
PDF
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
PPTX
Current and Future of Apache Kafka
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PDF
OSSNA Building Modern Data Streaming Apps
PPTX
Apache Flink: Past, Present and Future
PDF
Stream Processing with Flink and Stream Sharing
PDF
Apache Flink - a Gentle Start
PDF
Case-Study: Building Real-Time Applications at Scale-Cyclist Crash Detection ...
PPTX
Building Stream Processing as a Service
PDF
Flink forward-2017-netflix keystones-paas
PDF
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
PDF
Big Data Streams Architectures. Why? What? How?
Apache Flink(tm) - A Next-Generation Stream Processor
Performance Comparison of Streaming Big Data Platforms
Data Streaming For Big Data
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Santander Stream Processing with Apache Flink
Stream processing at Hotstar
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Current and Future of Apache Kafka
Don't Cross The Streams - Data Streaming And Apache Flink
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
OSSNA Building Modern Data Streaming Apps
Apache Flink: Past, Present and Future
Stream Processing with Flink and Stream Sharing
Apache Flink - a Gentle Start
Case-Study: Building Real-Time Applications at Scale-Cyclist Crash Detection ...
Building Stream Processing as a Service
Flink forward-2017-netflix keystones-paas
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Big Data Streams Architectures. Why? What? How?

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Predictive modeling basics in data cleaning process
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Quality review (1)_presentation of this 21
PDF
Business Analytics and business intelligence.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
SAP 2 completion done . PRESENTATION.pptx
modul_python (1).pptx for professional and student
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
[EN] Industrial Machine Downtime Prediction
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Clinical guidelines as a resource for EBP(1).pdf
Predictive modeling basics in data cleaning process
Reliability_Chapter_ presentation 1221.5784
Quality review (1)_presentation of this 21
Business Analytics and business intelligence.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Extending the Yahoo Streaming Benchmark