SlideShare a Scribd company logo
Data Pipelines With
Streamsets
Jowanza Joseph
@jowanza
Agenda
About me
The Problem Space
Streaming
StreamSets
Demo
Questions
About Me
Software Engineer at One ClickRetail
Scala / Spark / Mesos / Kubernetes
Author: Apache Spark Fieldbook
Cyclist
Husband and father
Data Pipelines With Streamsets
Retail Intelligence
Data Size
Real-Time
Operational Complexity
Data Pipelines With Streamsets
Batch Processing
What Are Data
Pipelines?
Data Pipelines With Streamsets
What Problems Do
They Solve?
Scalability
Complexity
Observability
Extendability
Lambda Architecture
Kappa Architecture
Data Pipelines With Streamsets
Goals
Data Provenance
Guaranteed Delivery
Configurable
Extendable
Multi-Protocol Support
DAG
Distribute
Data Pipelines With Streamsets
Based on Streams
Architecture
Running on Mesos
Analytics Data
Real-Time Data
Our Use Case
Demo

More Related Content

PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Spark Summit EU talk by Pat Patterson
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Building Data Pipelines with Spark and StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Architect’s Open-Source Guide for a Data Mesh Architecture
Spark Summit EU talk by Pat Patterson
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Simplify and Scale Data Engineering Pipelines with Delta Lake
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...

What's hot (20)

PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
Building Sessionization Pipeline at Scale with Databricks Delta
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
Redash: Open Source SQL Analytics on Data Lakes
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Intro to databricks delta lake
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
PDF
Power Your Delta Lake with Streaming Transactional Changes
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
PPTX
Spark Streaming with Azure Databricks
PDF
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
PDF
Democratizing Data
PDF
Migrating Your Data Platform At a High Growth Startup
PPTX
Intuit Analytics Cloud 101
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Delta Lake: Open Source Reliability w/ Apache Spark
SQL Analytics Powering Telemetry Analysis at Comcast
Building Sessionization Pipeline at Scale with Databricks Delta
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Redash: Open Source SQL Analytics on Data Lakes
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Intro to databricks delta lake
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Power Your Delta Lake with Streaming Transactional Changes
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Spark Streaming with Azure Databricks
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Democratizing Data
Migrating Your Data Platform At a High Growth Startup
Intuit Analytics Cloud 101
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Ad