SlideShare a Scribd company logo
Stephan Ewen
Flink committer
co-founder @ data Artisans
@StephanEwen
Apache
Flink
1 year of Flink - code
April 2014 April 2015
What is Flink
3
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Native workload support
5
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
E.g.: Non-native iterations
6
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
E.g.: Non-native streaming
7
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Native workload support
8
Flink
Streaming
topologies
Heavy
batch jobs
Machine Learning at scale
How can an engine natively support all these workloads?
And what does native mean?
Flink Engine
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
9
Program compilation
10
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
Flink by Use Case
11
Data Streaming Analysis
streaming dataflows
12
3 Parts of a Streaming Infrastructure
13
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
3 Parts of a Streaming Infrastructure
14
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
Result may be fed back to the broker
Cornerstones of Flink Streaming
 Pipelined stream processor (low latency)
 Expressive APIs
 Flexible operator state, streaming windows
 Efficient fault tolerance for streams and state.
15
Pipelined stream processor
16
Streaming
Shuffle!
Expressive APIs
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Windows
18
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Checkpointing / Recovery
19
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
Long batch pipelines
Batch on Streaming
20
Batch Pipelines
21
Batch on Streaming
 Batch programs are a special kind of streaming
program
22
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
Batch Pipelines
23
Data exchange (shuffle / broadcast)
is mostly streamed
Some operators block (e.g. sorts / hash tables)
Operators Execution Overlaps
24
Memory Management
25
Memory Management
26
Smooth out-of-core performance
27
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
Table API
28
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o => dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select('orderId, revenue.sum, orderDate, shipPrio")
Machine Learning Algorithms
Iterative data flows
29
Iterate by looping
 for/while loop in client submits one job per
iteration step
 Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
30
Iterate in the Dataflow
31
Example: Matrix Factorization
32
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
Graph Analysis
Stateful Iterations
33
Iterate natively with state/deltas
34
Effect of delta iterations…
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration
… fast graph analysis
36More at: http://data-artisans.com/data-analysis-with-flink.html
Closing
37
Flink Roadmap for 2015
Some highlights that we are working on
 More flexible state and state backends in streaming
 Master Failover
 Improved monitoring
 Integration with other Apache projects
• SAMOA, Zeppelin, Ignite
 More additions to the libraries
38
39
flink.apache.org
@ApacheFlink

More Related Content

PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PPTX
Flink internals web
Kostas Tzoumas
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Flink Streaming @BudapestData
Gyula Fóra
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Flink internals web
Kostas Tzoumas
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 

What's hot (20)

PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PPTX
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Apache Flink internals
Kostas Tzoumas
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 
Apache Flink Training: System Overview
Flink Forward
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Tech Talk @ Google on Flink Fault Tolerance and HA
Paris Carbone
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Flink Apachecon Presentation
Gyula Fóra
 
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Apache Flink@ Strata & Hadoop World London (20)

PPTX
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Robert Metzger
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PPTX
Apache Flink Deep Dive
DataWorks Summit
 
PPTX
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Data Analysis With Apache Flink
DataWorks Summit
 
PPTX
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
PDF
Near real-time anomaly detection at Lyft
markgrover
 
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PPTX
Apache flink
Ahmed Nader
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Robert Metzger
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Flink history, roadmap and vision
Stephan Ewen
 
Apache Flink Deep Dive
DataWorks Summit
 
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Data Analysis With Apache Flink
DataWorks Summit
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Aljoscha Krettek
 
Near real-time anomaly detection at Lyft
markgrover
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink Deep Dive
Vasia Kalavri
 
Apache flink
Ahmed Nader
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Robert Metzger
 
Ad

Recently uploaded (20)

PDF
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PDF
Solar Panel Installation Guide – Step By Step Process 2025.pdf
CRMLeaf
 
PPTX
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Become an Agentblazer Champion Challenge
Dele Amefo
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
DOCX
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
The Role of Automation and AI in EHS Management for Data Centers.pdf
TECH EHS Solution
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
Solar Panel Installation Guide – Step By Step Process 2025.pdf
CRMLeaf
 
Presentation of Computer CLASS 2 .pptx
darshilchaudhary558
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Exploring AI Agents in Process Industries
amoreira6
 
Presentation about variables and constant.pptx
kr2589474
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Become an Agentblazer Champion Challenge
Dele Amefo
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Presentation about variables and constant.pptx
safalsingh810
 
The Future of Smart Factories Why Embedded Analytics Leads the Way
Varsha Nayak
 
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
Q-Advise
 

Apache Flink@ Strata & Hadoop World London