1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved.
How to build leakproof stream processing pipelines with
Apache Kafka and Apache Spark​
2© Cloudera, Inc. All rights reserved.
● Guru Medasani
○ Data Science Architect at Domino Data Lab
○ Previously senior solutions architect at Cloudera
● Jordan Hambleton - Consulting Manager in San Francisco
○ Nearly 4 years as Resident Senior Architect at large technology firm
○ Previously software engineer building operational data systems on CDH
Introduction
3© Cloudera, Inc. All rights reserved.
● Intro
● Overview of Spark Streaming from Kafka
○ Workflow of the DStream and RDD
○ Spark Streaming Kafka consumer types
● Offset management
○ Motivation
○ Storing offsets in external data stores
● Q & A
Agenda
4© Cloudera, Inc. All rights reserved.
Overview
serverserver
partition1
Kafka Cluster
partitionn
partition2
. . . .
Topic A
142
143
144
. . .
121
122
123
. . .
137
138
139
. . .
server
partition3
129
130
131
. . .
server
executor1
executor2
executor3
Hadoop / YARN Cluster
executorn
. . . .
more parallelism
5© Cloudera, Inc. All rights reserved.
● DStream - sequence of RDDs
● Two approaches in KafkaUtils
○ Receiver based
○ Direct approach (recommended & the method we talk about)
● Spark streaming embeds a kafka client
○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer)
○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer)
○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer)
Overview Spark Streaming from Kafka
6© Cloudera, Inc. All rights reserved.
DStream and RDD Workflow
● Spark Streaming
○ batchIntervalInSeconds
○ stopGracefullyOnShutdown
● Kafka
○ bootstrap.servers
○ auto.offset.reset
○ group.id
○ key.deserializer
○ value.deserializer
7© Cloudera, Inc. All rights reserved.
● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (smallest|largest)
○ refresh.leader.backoff.ms - lost leader
● KafkaRDD for set of topic, partition, offsets
○ User can now get offset ranges from RDD
■ topic, partition, fromOffset (inclusive), untilOffset (exclusive)
● KafkaRDDPartition iterator
○ SimpleConsumer initialized and batches of events fetched
○ refresh.leader.backoff.ms - lost leader
Spark Streaming Kafka Consumer # 1
8© Cloudera, Inc. All rights reserved.
● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0
● Internal Kafka client uses new Java KafkaConsumer
● ConsumerStrategies
○ subscribe, assign, subscribe pattern
● LocationStrategies
○ executor distribution strategy (consistent, fixed, brokers)
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (earliest|latest)
○ Be careful - enable.auto.commit (default true)
○ heartbeat & session timeouts
Spark Streaming Kafka Consumer # 2
9© Cloudera, Inc. All rights reserved.
● DStream
○ Consumer poll for group coordination & discovery
○ Identify new partitions, from offsets
○ Pause consumer
○ seekToEnd to get untilOffsets
● KafkaRDD
○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}]
○ Attempts to assign offset range consistently for optimal consumer caching
● KafkaRDDPartition iterator
○ Initialize/lookup CachedKafkaConsumer with executor group
■ consumer assigned per single topic, partition with internal buffer
■ on cache miss, seek and poll
Spark Streaming Kafka Consumer # 2
10© Cloudera, Inc. All rights reserved.
Keeping Track
11© Cloudera, Inc. All rights reserved.
● Planned Maintenance
○ Upgrades
○ Bug-fixes
● Unplanned Maintenance
○ Failures
● Application Processing Errors
○ Wrong calculations
○ Updated algorithm over known streaming data
● More control over messages
○ Just earliest and latest are insufficient
Motivation for Tracking Offsets
12© Cloudera, Inc. All rights reserved.
● Cast RDD to HasOffsetRanges
● DStream’s first transformation
Obtaining Offsets
13© Cloudera, Inc. All rights reserved.
Offset management Workflow
● Limited options prior to
spark-streaming-kafka-0-10
● Store offsets in external datastore
○ Checkpoints (Not recommended)
○ ZooKeeper
○ Kafka
○ HBase
● Do not have to manage offsets
14© Cloudera, Inc. All rights reserved.
● ZooKeeper
○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
○ Only retains latest committed offsets
○ Can easily be managed by external tools
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in ZooKeeper
15© Cloudera, Inc. All rights reserved.
● Kafka
○ CanCommitOffsets provides async commit to internal kafka topic
○ More difficult to manage internal kafka topic manually
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in Kafka
16© Cloudera, Inc. All rights reserved.
● HBase
○ Unique entry per consumer group, batch
● Fine-grained monitoring over time
● HBase shell for easy management
● Get latest entry -
○ scan 'prod_stream',
○ STARTROW =>'device_alerts:csi_group',
○ REVERSED =>TRUE,
○ LIMIT =>1
Offset Management in HBase
schema:
row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS>
column family: offsets
qualifier: <PARTITION_ID>
value: <OFFSET_ID>
17© Cloudera, Inc. All rights reserved.
● Spark Streaming job started for the first time
● No changes in Kafka partitions
● Increase in number of Kafka partitions
http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Starting Streaming Jobs with Known Offsets
18© Cloudera, Inc. All rights reserved.
Questions?
Thank you
Jordan Hambleton
Guru Medasani

More Related Content

PDF
dplyr Interfaces to Large-Scale Data
PDF
How to use Impala query plan and profile to fix performance issues
PDF
Apache Hadoop 3
PPTX
Road to Cloudera certification
PPTX
A deep dive into running data analytic workloads in the cloud
PPTX
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
PDF
Application Architectures with Hadoop
dplyr Interfaces to Large-Scale Data
How to use Impala query plan and profile to fix performance issues
Apache Hadoop 3
Road to Cloudera certification
A deep dive into running data analytic workloads in the cloud
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Application Architectures with Hadoop

What's hot (20)

PPTX
Apache Spark Operations
PPTX
Intro to Apache Spark
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
PDF
Application architectures with hadoop – big data techcon 2014
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
PDF
Impala Performance Update
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PPTX
Spark One Platform Webinar
PDF
Kudu Cloudera Meetup Paris
PDF
Introduction to Apache Kudu
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
PPTX
Enabling the Active Data Warehouse with Apache Kudu
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
PDF
Improving HDFS Availability with IPC Quality of Service
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PDF
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
PPTX
Performance Optimizations in Apache Impala
Apache Spark Operations
Intro to Apache Spark
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Application architectures with hadoop – big data techcon 2014
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Impala Performance Update
Unlock Hadoop Success with Cloudera Navigator Optimizer
Data Science at Scale Using Apache Spark and Apache Hadoop
Spark One Platform Webinar
Kudu Cloudera Meetup Paris
Introduction to Apache Kudu
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Enabling the Active Data Warehouse with Apache Kudu
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Flink & Kudu: a connector to develop Kappa architectures
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Improving HDFS Availability with IPC Quality of Service
One Hadoop, Multiple Clouds - NYC Big Data Meetup
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Performance Optimizations in Apache Impala
Ad

Similar to How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark (20)

PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PDF
Introduction to apache kafka
PPTX
Hadoop 3 (2017 hadoop taiwan workshop)
PDF
Migrating to Apache Spark at Netflix
PDF
Fraud Detection using Hadoop
PDF
What's New with Ceph - Ceph Day Silicon Valley
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PDF
Apache Hadoop 3.0 Community Update
PDF
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
PPTX
Fraud Detection Architecture
PPTX
Architecting a Fraud Detection Application with Hadoop
PPTX
SFHUG Kudu Talk
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PDF
Yarns About Yarn
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
PPTX
Empower Hive with Spark
PDF
Kudu austin oct 2015.pptx
PDF
Terraforming your Infrastructure on GCP
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Introduction to apache kafka
Hadoop 3 (2017 hadoop taiwan workshop)
Migrating to Apache Spark at Netflix
Fraud Detection using Hadoop
What's New with Ceph - Ceph Day Silicon Valley
Hadoop 3 @ Hadoop Summit San Jose 2017
Apache Hadoop 3.0 Community Update
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
Fraud Detection Architecture
Architecting a Fraud Detection Application with Hadoop
SFHUG Kudu Talk
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming & Kafka-The Future of Stream Processing
Yarns About Yarn
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Empower Hive with Spark
Kudu austin oct 2015.pptx
Terraforming your Infrastructure on GCP
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PPTX
CNN LeNet5 Architecture: Neural Networks
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Cost to Outsource Software Development in 2025
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
assetexplorer- product-overview - presentation
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
"Secure File Sharing Solutions on AWS".pptx
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CNN LeNet5 Architecture: Neural Networks
Why Generative AI is the Future of Content, Code & Creativity?
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Cost to Outsource Software Development in 2025
Time Tracking Features That Teams and Organizations Actually Need
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Computer Software and OS of computer science of grade 11.pptx
Weekly report ppt - harsh dattuprasad patel.pptx
Trending Python Topics for Data Visualization in 2025
assetexplorer- product-overview - presentation
Tech Workshop Escape Room Tech Workshop
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Patient Appointment Booking in Odoo with online payment
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

  • 1. 1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved. How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​
  • 2. 2© Cloudera, Inc. All rights reserved. ● Guru Medasani ○ Data Science Architect at Domino Data Lab ○ Previously senior solutions architect at Cloudera ● Jordan Hambleton - Consulting Manager in San Francisco ○ Nearly 4 years as Resident Senior Architect at large technology firm ○ Previously software engineer building operational data systems on CDH Introduction
  • 3. 3© Cloudera, Inc. All rights reserved. ● Intro ● Overview of Spark Streaming from Kafka ○ Workflow of the DStream and RDD ○ Spark Streaming Kafka consumer types ● Offset management ○ Motivation ○ Storing offsets in external data stores ● Q & A Agenda
  • 4. 4© Cloudera, Inc. All rights reserved. Overview serverserver partition1 Kafka Cluster partitionn partition2 . . . . Topic A 142 143 144 . . . 121 122 123 . . . 137 138 139 . . . server partition3 129 130 131 . . . server executor1 executor2 executor3 Hadoop / YARN Cluster executorn . . . . more parallelism
  • 5. 5© Cloudera, Inc. All rights reserved. ● DStream - sequence of RDDs ● Two approaches in KafkaUtils ○ Receiver based ○ Direct approach (recommended & the method we talk about) ● Spark streaming embeds a kafka client ○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer) ○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer) ○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer) Overview Spark Streaming from Kafka
  • 6. 6© Cloudera, Inc. All rights reserved. DStream and RDD Workflow ● Spark Streaming ○ batchIntervalInSeconds ○ stopGracefullyOnShutdown ● Kafka ○ bootstrap.servers ○ auto.offset.reset ○ group.id ○ key.deserializer ○ value.deserializer
  • 7. 7© Cloudera, Inc. All rights reserved. ● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2 ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (smallest|largest) ○ refresh.leader.backoff.ms - lost leader ● KafkaRDD for set of topic, partition, offsets ○ User can now get offset ranges from RDD ■ topic, partition, fromOffset (inclusive), untilOffset (exclusive) ● KafkaRDDPartition iterator ○ SimpleConsumer initialized and batches of events fetched ○ refresh.leader.backoff.ms - lost leader Spark Streaming Kafka Consumer # 1
  • 8. 8© Cloudera, Inc. All rights reserved. ● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0 ● Internal Kafka client uses new Java KafkaConsumer ● ConsumerStrategies ○ subscribe, assign, subscribe pattern ● LocationStrategies ○ executor distribution strategy (consistent, fixed, brokers) ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (earliest|latest) ○ Be careful - enable.auto.commit (default true) ○ heartbeat & session timeouts Spark Streaming Kafka Consumer # 2
  • 9. 9© Cloudera, Inc. All rights reserved. ● DStream ○ Consumer poll for group coordination & discovery ○ Identify new partitions, from offsets ○ Pause consumer ○ seekToEnd to get untilOffsets ● KafkaRDD ○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}] ○ Attempts to assign offset range consistently for optimal consumer caching ● KafkaRDDPartition iterator ○ Initialize/lookup CachedKafkaConsumer with executor group ■ consumer assigned per single topic, partition with internal buffer ■ on cache miss, seek and poll Spark Streaming Kafka Consumer # 2
  • 10. 10© Cloudera, Inc. All rights reserved. Keeping Track
  • 11. 11© Cloudera, Inc. All rights reserved. ● Planned Maintenance ○ Upgrades ○ Bug-fixes ● Unplanned Maintenance ○ Failures ● Application Processing Errors ○ Wrong calculations ○ Updated algorithm over known streaming data ● More control over messages ○ Just earliest and latest are insufficient Motivation for Tracking Offsets
  • 12. 12© Cloudera, Inc. All rights reserved. ● Cast RDD to HasOffsetRanges ● DStream’s first transformation Obtaining Offsets
  • 13. 13© Cloudera, Inc. All rights reserved. Offset management Workflow ● Limited options prior to spark-streaming-kafka-0-10 ● Store offsets in external datastore ○ Checkpoints (Not recommended) ○ ZooKeeper ○ Kafka ○ HBase ● Do not have to manage offsets
  • 14. 14© Cloudera, Inc. All rights reserved. ● ZooKeeper ○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset) ○ Only retains latest committed offsets ○ Can easily be managed by external tools ○ Leverage existing monitoring for Lag, no historical insight Offset Management in ZooKeeper
  • 15. 15© Cloudera, Inc. All rights reserved. ● Kafka ○ CanCommitOffsets provides async commit to internal kafka topic ○ More difficult to manage internal kafka topic manually ○ Leverage existing monitoring for Lag, no historical insight Offset Management in Kafka
  • 16. 16© Cloudera, Inc. All rights reserved. ● HBase ○ Unique entry per consumer group, batch ● Fine-grained monitoring over time ● HBase shell for easy management ● Get latest entry - ○ scan 'prod_stream', ○ STARTROW =>'device_alerts:csi_group', ○ REVERSED =>TRUE, ○ LIMIT =>1 Offset Management in HBase schema: row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS> column family: offsets qualifier: <PARTITION_ID> value: <OFFSET_ID>
  • 17. 17© Cloudera, Inc. All rights reserved. ● Spark Streaming job started for the first time ● No changes in Kafka partitions ● Increase in number of Kafka partitions http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/ Starting Streaming Jobs with Known Offsets
  • 18. 18© Cloudera, Inc. All rights reserved. Questions? Thank you Jordan Hambleton Guru Medasani