0% found this document useful (0 votes)
18 views

Data Engg

The document outlines the components of a modern data stack, detailing layers such as Data Sources, Ingestion, Storage, Processing, Data Warehouse, Analytics, Orchestration, Observability, and Governance. It provides examples of tools and technologies used in each layer, along with analogies to restaurant operations to simplify understanding. Additionally, it compares practices in Application Security (AppSec) with Data Engineering, highlighting equivalent terms and techniques.

Uploaded by

cheedaharinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Engg

The document outlines the components of a modern data stack, detailing layers such as Data Sources, Ingestion, Storage, Processing, Data Warehouse, Analytics, Orchestration, Observability, and Governance. It provides examples of tools and technologies used in each layer, along with analogies to restaurant operations to simplify understanding. Additionally, it compares practices in Application Security (AppSec) with Data Engineering, highlighting equivalent terms and techniques.

Uploaded by

cheedaharinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Modern Data Stack – Simplified Table

Layer Purpose Examples


Databases of Tools
(PostgreSQL,
Data Sources Where data originates from MySQL), APIs, SaaS
Ingestion Collecting and importing data (Salesforce)
Fivetran, Stitch, Kafka, Flume
Storing raw or semi-processed AWS S3, Azure Data Lake,
Storage / Data Lake
data
Cleaning, transforming, and Google Cloud
dbt, Apache Storage
Spark, Airflow,
Processing / ETL/ELT
preparing data Talend, AWS Glue
Snowflake, BigQuery, Redshift,
Data Warehouse Structured storage for analytics
Generating reports and Azure Synapse
Tableau, Power BI, Looker,
Analytics / BI Layer
dashboards
Managing pipeline workflows Superset
Apache Airflow, Prefect,
Orchestration
and dependencies
Ensuring data quality and Dagster
Monte Carlo, Databand, Great
Observability / Monitoring
pipeline health
Controlling access, Expectations
Collibra, Alation, Apache
Governance & Security
compliance, and lineage Atlas, Immuta
Involves designing and managing data pipelines, ETL processes, and ensuring clean, reliable data for analytics and
decision-making.
Typical pain points—like dealing with data silos, pipeline failures, or scaling to handle real-time data—and how solutions
like cloud data platforms, data lakes, and orchestration tools (like Airflow or dbt) play a role in solving them.
Restaurant Equivalent Explanation
Ingredients arriving at the Raw data coming from different sources
kitchen
Unloading and storing (like
Toolssales, CRM,data
that bring sensors, etc.) like
into storage,
ingredients in thewhere
Pantry or fridge pantry Fivetran or Kafka
Central place to store raw data (S3, Azure
ingredients are kept cooking
Chopping, cleaning, Data Lake)
Data is cleaned, transformed, and made
the foodthe food and placing
Plating ready (e.g., ready-to-serve
Structured, with Spark or dbt)
data
it on thedelivering
Waiter serving counter
the dish to (Snowflake, Redshift, BigQuery)
Dashboards and reports (Power BI,
Kitchen manager
the customer Tableau)
Manages serve data insights
workflows to business
and pipeline
coordinating all chefs and
Food safety checks, quality schedules (like Airflow)
Ensures pipelines and data are accurate
timings
control
Who has access to the and
Ensures only authorized Expectations)
reliable (e.g., Great people access the
kitchen and food storage right data (Collibra, Immuta)
ean, reliable data for analytics and
handle real-time data—and how solutions
) play a role in solving them.
Table: Popular Apache Projects (With Simple Descriptions)
Open-Source
Project Name Category Whata high-speed
Like It Does (Simple Analogy)
courier for Vendor
Apache
Apache Kafka Data Streaming data – sends messages between
Big Data Like (Confluent core)
Apache Hadoop apps a warehouse – stores and Apache Hadoop
Processing
Fast Data processes huge data– files
Like a microwave processes
Apache Spark Lets you ask questions (like Apache Spark
Processing
SQL for Big large data super-fast
Apache Hive SQL) on big data stored in Apache Hive
Data
Real-time Data Like Spark, but made for real-
Apache Flink Hadoop Apache Flink
Stream
Workflow time
Like flowing data – schedules Apache Airflow
a task manager
Apache Airflow
Automation
Data Flow and
Likeautomates data jobs – routes (via ASF)
a traffic controller
Apache NiFi A programming model to run Apache NiFi
Automation
Data Pipeline data between systems
Apache Beam data pipelines on engines like
Like a notebook – stores lots of Apache Beam
Apache API
NoSQL Apache
Spark, Flink data (used by
fast-changing
Cassandra Database A database built on Hadoop – Cassandra
Apache HBase Big Data DB Instagram) Apache HBase
stores billions– of
Like Google records
helps apps search
Apache Solr Search Engine Apache Solr
through
The brains behind Solr – quickly
massive content does the
Apache Lucene Search Library Apache Lucene
Apache Coordination actual
Like a indexing and searching
project manager – keeps Apache
ZooKeeper Tool Serves
systemswebsites
in sync – one of the ZooKeeper
Apache HTTP
Web Server most used
Connects web servers
different appsinand
the Apache HTTPD
Server Integration
Apache Camel world
Framework
systems
Like Kafkalike–ahandles
universalreal-time Apache Camel
Apache Pulsar translator
Event Streaming messages, but with built-in Apache Pulsar
Real-time stream storage Azure Stream
Apache Storm Real-time Kinesis Data Analytics
processing Amazon Timestream (partial Analytics
Azure Data
Apache Druid analytics
Web-based on
parity)
EMR Notebooks / SageMaker Explorer
Synapse
Apache Zeppelin time-series
notebook for Used in
Apache Parquet / Columnar data Used in Glue, Athena, Redshift Notebooks
data analytics
Studio
Row-based data Spectrum Synapse, Data
ORC storage format Glue / EMR / Kafka-compatible Used
Apache Avro format Lake in Data
Workflow formats Azure Data
Factory
Apache Oozie (serialization)
scheduler for AWS Step Functions Factory
Hadoop jobs pipelines
Commercial Vendors /
Simple Analogy AWS Equivalent Azure Event
Azure Equivalent Pub/Sub
GCP Equivalent
(Kafka-
Distributions
Confluent, Aiven, Redpanda High-speed courier Amazon MSK
Cloudera, Hortonworks Hubs (Kafka- like), Confluent
(Kafka-compatible) for
Datadata
warehouse (Managed
Amazon S3Kafka)
/ EFS Azure Data Lake Cloud
(merged), compatible) Cloud Storage /
Databricks,Amazon EMR,
Cloudera, for massive files / FSx for Hadoop Azure
StorageSynapse
(Gen2) / BigLake
Microsoft HDInsight Microwave for big AWS Glue / Dataflow
Amazon EMR, Google HDInsight with
Cloudera, Qubole, Amazon data
Ask questions on EMR Amazonwith Spark/
Athena Azure (Apache Beam
Cloud Dataproc Spark Synapse / BigQuery /
Athena (Hive-compatible) big data Kinesis
EMR HiveData Azure Stream
HDInsight Hive Dataflow
Dataproc Hive
Ververica, Alibaba, AWS
Real-time Spark Analytics
Managed (some Analytics (limited (Batch/Stream
Kinesis Data Google
Astronomer, Analytics
Cloud Task manager for overlap) Azure
Workflows parity)Data with
CloudBeam SDK)
Composer
Composer, Amazon MWAA data jobs AWS Glue for Factory (limited) Cloud Composer
Cloudera NiFi, Hortonworks Traffic controller Apache Airflow Azure Data
Unified workflows / Step Custom (Airflow), Data
DataFlow
Google Cloud Dataflow for data Dataflow Factory via Data Cloud Dataflow
programming Functions(native Factory or Fusion
(Beam-powered), Payara
DataStax, Instaclustr, Notebook for fast- support)
Keyspaces Cosmos DB (native)
Astra DB (via
model for data HDInsight
Amazon Amazon
Cloudera,Keyspaces
Alibaba Cloud changing data (Cassandra- (Cassandra
Azure HBase on marketplace)
API)
Big data database DynamoDB / Bigtable
HBase, AWS EMR
Lucidworks, OpenSearch Google-like OpenSearch HDInsight
Azure Cognitive ElasticSearch via
HBase on EMR
Used inside Solr,
(Solr competitor), Cloudera enterprise
The brain behind (AWS fork)
search Search GCP Marketplace
Elasticsearch, and Project
Built into older Kafka, Solr manager
Lucidworks for distributed
Hadoop,
Red Hat, Cloudera Stack
Bitnami, Cloud systemswebsites
Serves
hosting Universal Amazon Azure Logic
Red Hatproviders
Fuse, Talend ESB, Cloud Functions
translator between EventBridge / Apps / Service
WSO2
StreamNative, DataStax Kafka-style courier Step Functions + Eventarc
apps Bus
(Astra Streaming), Splunk + built-in storage
Dataflow (streaming)
BigQuery (with partitioned
tables)
Vertex AI Workbench /
Colab
Used in BigQuery, Dataproc
Used in Pub/Sub, Dataflow
Cloud Composer
AppSec vs. Data Engineering – Similar Terms & Techniques
AppSec Practice Equivalent in Data Engineering
SAST (Static Code Analysis) Data Pipeline Linting / DAG Validation
Data Quality Rules (Great Expectations,
Custom Rules (in SAST/DAST)
dbt tests)& Impact Analysis
Lineage
Taint Analysis
(OpenLineage, Amundsen)in Data
Dependency Management
SCA (Dependency Analysis)
Pipelines (dbt packages, libraries)
Reachability Analysis (in SCA) Column-Level Lineage + Usage Stats
Data Catalogs + Metadata Lineage (e.g.,
SBOM (Software Bill of Materials)
VEX (Vulnerability Exploitability Atlan, Amundsen,
Data Contract Unity Catalog)
Validations + Quality
Exchange) Exception Alerts
SLSA (Supply Chain Levels for Software Data Supply Chain Assurance (dbt CI/CD
Artifacts) + validations
Data + approvals)
Provenance / Pipeline Signing (e.g.,
Supply Chain Security (e.g., Sigstore) Runtime Data Validations / Real-time
data fingerprinting, audit logs)
DAST (Dynamic App Security Testing) Anomaly Detection (Monte Carlo,
Synthetic Data Injection / Chaos Testing
Databand)
Penetration Testing
in Pipelines
Layman Analogy
Checking if your pipeline “blueprint” is
correct
Ensuringbefore building
data looks and behaves as
expected
Tracing how “bad data” flows through the
system
Checking which other tools or code your
pipeline
Checkingdepends on column” is ever
if that “risky
used downstream
Inventory of all datasets, fields, owners,
and their sources
Knowing when bad data can actually hurt
you
Confidence that your data is built correctly
at every step
Verifying no tampering in the data’s
journey
Watching data as it flows and catching
issues
Sendinglive
test data to break things and see
how systems react
Similar Data Engineering
AppSec Practice Explanation / Analogy
Custom Rules in Custom Data Practice
Quality Rules / insecure code, custom data rules
SASTAnalysis in
Taint Validation
Data Checks
Lineage & Contamination ensuretobad/invalid
input data data
sensitive code; doesn't
Dependency
SAST & Checks Determines
lineage traces how pipelinedata to
untrusted
Pipeline Dependency Mapping &
Reachability components depend on
SBOM lists software each other
components;
Failure Impact/ Data
Data Catalog Analysis
Asset
Analysis
SBOM (SCA) and how failure propagates.
data catalog lists datasets, lineage,
Inventory
Data Quality Exceptions & Risk- data systems can for
flagcompliance.
exceptions
VEX (Supply- owners – critical
SLSA Data
BasedSupply
AccessChain
Logs Integrity based on sensitivity and business
pipeline; data engineers secure
chain Levels for Checks (Provenance, ETL
Supply Chain Secure Data Pipelines (Encryption, ETL jobsdata
secured, withpipelines
provenance use and
Software Artifacts) Attestations)
Security IAM,
Synthetic DataMgmt)
Secrets Testing & Real- encryption,
runtime; data secrets
teamsrotation, and
test pipelines
DAST Like pen testing apps, simulate
Time Data Monitoring
Data Access Audits & Simulated with synthetic/real-time anomaly
Penetration Testing unauthorized access to sensitive
Breach Exercises
Dockerized ETL Job Hardening / Both scan and secure containers
Container Security datasets.
Image Scanning Ensure
used to data APIsorare
run apps rate-limited,
data pipelines.
Secured Data APIs / Throttling /
API Security authenticated, and do not expose
Test Management AuthN-Z
Data Pipeline Test Coverage (Unit, Ensure
sensitivedata transformations and
data.
(QA in AppSec) Integration, Regression Tests) Verifies
business correct
logic arestructure
properlyand tested.
Schema Validation & Business
Functional Testing business
Ensure data pipelinesdata,
meaning of meetjust like
Performance Rule Testing Latency & SLA
Throughput, app testing verifies
performance app behavior.
goals under load,
Testing Monitoring in Pipelines
similar to app performance testing.

AppSec Security Practice Similar/Equivalent in Data Engineering


Custom Rules (in SAST tools) Custom data validation rules (e.g., schema enforcement, data quality
Taint Analysis in SAST (track flow of untrusted input) Data lineage and provenance tracking (e.g., using OpenLineage, Apa
Dependency and Reachability Analysis in SCA Data pipeline dependency analysis and DAG evaluation (e.g., Airflow
SBOM and VEX Data BOM / metadata cataloging (e.g., using DataHub or Amundsen)
SLSA (Supply-chain Levels for Software Artifacts) Data pipeline security posture (e.g., CI/CD for ETL jobs, signed data
Supply Chain Security (end-to-end integrity of code and iSecure data pipeline lifecycle: signed data snapshots, access-controll
DAST Runtime validation of data output (e.g., data quality checks, anomaly
Penetration Testing Data pipeline chaos engineering / fault injection (e.g., simulate malic
Containers (image scanning, runtime security) Containerized data services with secure images (e.g., scanning Airflo
API Security (auth, rate-limiting, data leakage preventionSecuring data APIs and access layers (e.g., enforcing authZ/authN on
Test Management Data test management platforms (e.g., Soda Cloud, Monte Carlo, Dat
Functional Testing Unit and integration tests for ETL jobs and transformations (e.g., usin
Performance Testing Load testing for queries and data pipelines (e.g., simulate concurrent
Similar Practice in Data
Explanation
Enforce custom checks/ on
Analogy
code vs.
Custom Data Engineering
Validation Rules /
custom
Trace untrusted code input vs. and
checks on data quality trace
Data Lineage
Data Quality Checks
& Contamination schema
untrusted validity.
or low-quality data through
Tracking Analyze software dependency vs.
DAG Dependency Mapping & pipelines.
pipeline component dependency and data
Failure Impact&Analysis
Data Catalog Metadata Software inventory vs. complete dataset
impact assessment.
Management Highlight
inventory exploitable
with lineagevulnerabilities
and ownership.vs.
Risk-Based Exception Reporting &
alert on high-risk data conditions
Secure software build chains vs. secure or
Access Logs Provenance &
Data Pipeline exceptions.
and auditable data transformation
Attestation
Secure ETL Jobs (Encryption, Secure software supply chain vs. secure
pipelines.
Secrets, IAM, Auditing) Runtime
data app testing
processing vs. runtime pipeline
pipelines.
Real-Time Data Anomaly
testing using synthetic data or anomaly
Detection
Data Access/ Synthetic Testing
Audits / Breach Simulated
detection. attacks on apps vs. simulated
Simulation Secure containers in app vs. secure
Containerized Job Image Scanning unauthorized access to data systems.
containers in data processing (Spark,
& Runtime
Secure DataSecurity
APIs (AuthN/Z, Protect APIs
Airflow, etc.).serving app features vs.
Throttling, Rate Limits) App
APIs test case data.
serving management vs. data
ETL Unit Testing & Regression
pipeline test coverage and regression
App functionality vs. ensuring data
Testing
Schema Checks & Business Rule control.
transformations preserve business
App stress/load testing vs. measuring
Validation
Pipeline Throughput, Latency, meaning.
ETL job performance under high data
SLA Monitoring
volume.

ta Engineering
rules (e.g., schema enforcement, data quality checks via Great Expectations or Deequ)
nance tracking (e.g., using OpenLineage, Apache Atlas) to trace input source and transform
cy analysis and DAG evaluation (e.g., Airflow, dbt) to check for unused/critical transforma
ataloging (e.g., using DataHub or Amundsen) with data risk annotation and sensitivity tagg
posture (e.g., CI/CD for ETL jobs, signed data artifacts, pipeline execution attestation)
ecycle: signed data snapshots, access-controlled storage, encryption at rest/in-transit
data output (e.g., data quality checks, anomaly detection, and output validation on BI dashb
gineering / fault injection (e.g., simulate malicious input or broken sources to test resilience)
ices with secure images (e.g., scanning Airflow/Spark containers with Trivy or Clair)
access layers (e.g., enforcing authZ/authN on Presto/Trino endpoints, query firewalls)
platforms (e.g., Soda Cloud, Monte Carlo, Datafold test suite management for pipelines)
ts for ETL jobs and transformations (e.g., using pytest, dbt tests)
s and data pipelines (e.g., simulate concurrent access on warehouses or test latency in data
CloudSec vs. Data Engineering: Technique Mapping Table
Similar/Applied Technique in
CloudSec Practice / Tool
CSPM (Cloud Security Posture Data
Data Infra Engineering
Posture Management
Management)
CWPP (Cloud Workload (via Infra-as-Code)
Data Workload Protection
Protection Platform)
CIEM (Cloud Infrastructure (runtime/data services security)
Data Access Governance / RBAC
Entitlement Management) Analysis
Data Threat Modeling and
Threat Intelligence (Cloud-based)
Anomaly
Access/DataDetection
Quality Policies as
Policy-as-Code (OPA, Rego)
Pipeline Scanning (IaC security Code
Scanning Data Pipelines for Risky
scanning)
Workload Protection (runtime Patterns
Real-time(e.g., secrets, PII
Monitoring leaks)
of ETL Job
behavior monitoring)
Policy Enforcement (e.g., deny Behavior / Data API Access
Data Platform Governance (deny
risky Data Compliance (e.g., GDPR,
Compliance Rules (e.g., HIPAA, unsafe schema or access changes)
deployments)
CCPA via data tagging and
GDPR via cloud config analysis)
lineage)
🔄 Explanation: How These Techniques Translate
CSPM in Data Engg → It's about auditing cloud resources used by data pipelines: buckets, DBs, secrets managers. You check
encrypted, etc.
CWPP → Think of it as securing the runtime of your Spark jobs, data containers, and even Lambda functions that move/proce
CIEM → Who can access which data? Track and prune over-permissioned roles in data platforms (IAM/ACLs/RBAC).
Threat Intel → Feed known malicious indicators into your ingestion filters or alerting systems in pipelines.
Policy-as-code & Compliance → Automate enforcement of your org's data policies—like not pushing PII to logs or test enviro
ble
How It’s
Tools like Steampipe, OpenApplied in Data
Policy Agent Engineering
(OPA), or Checkov are used to
evaluate cloud data environments (e.g., S3, RDS, BigQuery)
Tools like Falco and Trivy monitor containers running data workloads against
compliance/security
(e.g., policies.
ToolsSpark, Flink, Kafka)
and practices like IAM for analysis,
abnormalpolicy-as-code
behavior and image
(OPA), or even
vulnerabilities.
Sentry/custom
Integration of threat feeds (e.g., IP/domain reputation) intowarehouses
tools to monitor and enforce access to data data ingestion
(Snowflake,
pipelines Redshift).
Used to enforce row/column-level data access control, schemadata
to filter malicious data sources or detect suspicious access
evolution
patterns.
policies, datarepositories
retention, etc.,
Check code and often via Rego,
Airflow/dbt YAML,
configs or native tooling in
for hardcoded
dbt/Snowflake.
credentials, PII, or insecure data sink configurations using tools like
Use Prometheus/Grafana,
GitLeaks, Falco, or custom alerting to detect anomalies
Checkov, or TruffleHog.
Governance
like excessive rules block
reads, schema
failed jobs, drift, disallow sensitive
data exfiltration data export to
patterns.
untrusted
Automated tagging of sensitive data, lineage tracing, auditor
targets, etc., often enforced via CI/CD pipelines dbt's CI
logging for
checks.
data access. Tools: Collibra, DataHub, BigID, custom audits in dbt or
cloud-native tools.

used by data pipelines: buckets, DBs, secrets managers. You check if they're public,
jobs, data containers, and even Lambda functions that move/process data.
er-permissioned roles in data platforms (IAM/ACLs/RBAC).
ingestion filters or alerting systems in pipelines.
your org's data policies—like not pushing PII to logs or test environments.
Traditional IT Systems vs. Cloud Services (AWS, Azure, GCP)
Traditional IT Component AWS Azure
Physical Server EC2 Virtual Machines (VMs)
Virtualization Platform EC2 with AMI, Auto Scaling Azure Virtual Machine Scale Sets
Server OS & Images Amazon Machine Images (AMI) Azure Images
Load Balancer Elastic Load Balancer (ELB) Azure Load Balancer / Application
Storage (SAN, NAS) EBS, EFS, S3 Azure Disks, Azure Files, Blob Sto
Networking (Switches, VLANs) VPC, Subnets, Route Tables Virtual Network (VNet), Subnets
DNS Server Route 53 Azure DNS
Firewall / Security Security Groups, NACLs, WAF NSGs, Azure Firewall, Azure WAF
Directory Services (AD) AWS Directory Service Azure Active Directory
Monitoring / Logging CloudWatch, CloudTrail Azure Monitor, Log Analytics
Backup & Recovery AWS Backup, S3 Glacier Azure Backup, Azure Site Recover
Databases (On-Prem SQL, etc.) RDS, Aurora, DynamoDB Azure SQL, Cosmos DB
Data Center / Colocation AWS Outposts, Local Zones Azure Stack, Azure Edge Zones
Enterprise Service Bus (ESB) Amazon EventBridge, SQS, SNS Azure Service Bus, Event Grid
Application Hosting Elastic Beanstalk, ECS, EKS App Services, Azure Kubernetes Se
CI/CD Tools CodePipeline, CodeBuild Azure DevOps, GitHub Actions
GCP Description / Use
Compute Engine Scalable virtual machines for compute tasks.
Managed Instance Groups Auto-scaling and load-balanced VM management.
Custom Images / OS Images Pre-configured OS and software templates.
Cloud Load Balancing Distributes incoming traffic across multiple resources.
Persistent Disk, Filestore, Cloud StBlock, file, and object storage options.
VPC, Subnets Virtual networks for securely isolating and connecting resources.
Cloud DNS Domain name resolution and routing services.
Firewall Rules, VPC Firewall Protect cloud resources with fine-grained traffic rules.
Cloud Identity, Managed Microsof Identity and access management services.
Stackdriver (now Cloud OperationsObservability: metrics, logs, alerts.
Backup and DR via Cloud Storage Data backup and disaster recovery solutions.
Cloud SQL, Bigtable, Firestore Managed relational and NoSQL databases.
Anthos, GCP Edge Hybrid cloud solutions bridging on-prem and cloud.
Pub/Sub, Eventarc Messaging and event-driven architecture support.
App Engine, GKE Managed platforms for hosting applications.
Cloud Build, Cloud Deploy Continuous Integration and Delivery pipelines.
Cloud Services Commonly Used by Data Engineers
Stage Function AWS
Ingest Real-time Data Streaming Kinesis Data Streams / Firehose
AWS Glue, AWS Data Pipeline, S3
Batch Data Ingestion
Uploads
API / Webhook Ingestion API Gateway + Lambda
Store Object Storage (Raw Data) Amazon S3
File Storage (Semi-Structured) Amazon EFS, FSx
Relational DB Amazon RDS, Aurora
NoSQL / Semi-Structured DB DynamoDB
Data Lake S3 + Lake Formation
Process ETL/ELT AWS Glue / EMR
Batch Processing AWS EMR (Spark, Hive)
Real-time Processing Kinesis Analytics / Lambda
AWS Step Functions / Managed
Orchestrate Job Scheduling & Workflow
Workflows
Serverless Task Automation AWS Lambda
Analyze BI / SQL Engine Amazon Athena / Redshift
ML Integration SageMaker
Secure IAM & RBAC IAM + Roles + Policies
Encryption KMS + S3 Bucket Policies
Audit / Monitoring CloudTrail / CloudWatch
DevOps / Infra Infra as Code AWS CloudFormation / Terraform
Containerized Jobs ECS / EKS / Fargate

✅ Common Cloud Patterns for Data Engineers:


Modern Data Lakes: S3 + Glue + Athena (AWS), Blob + Synapse + ADF (Azure), Cloud Storage + BigQuery + Dataflow
(GCP)
ETL Pipelines: Glue or EMR (AWS), Data Factory + Spark (Azure), Dataflow + Composer (GCP)
Batch & Streaming Together: Kinesis + Lambda (AWS), Event Hub + Stream Analytics (Azure), Pub/Sub + Dataflow
(GCP)
Azure GCP
Pub/Sub / Dataflow (streaming
Azure Event Hubs / IoT Hub
mode)
Azure Data Factory Cloud Storage, Transfer Service
Azure API Management +
Cloud Endpoints + Cloud Functions
Functions
Azure Blob Storage Google Cloud Storage
Azure Files Filestore
Azure SQL DB Cloud SQL (MySQL/Postgres)
Cosmos DB Firestore / Bigtable
Azure Data Lake Gen2 Cloud Storage + BigLake
Data Factory / Synapse Pipelines Dataflow / Dataproc
Azure HDInsight / Synapse Dataproc (Managed Hadoop/Spark)
Stream Analytics / Azure
Dataflow (Apache Beam)
Functions
Azure Data Factory Pipelines Cloud Composer (Airflow)
Azure Functions Cloud Functions
Synapse Analytics / Power BI BigQuery / Looker
Azure ML Vertex AI
Azure Active Directory + Role-
Cloud IAM
based Access
Azure Key Vault Cloud KMS + CMEK
Cloud Audit Logs + Cloud
Azure Monitor + Log Analytics
Monitoring
ARM Templates / Terraform Deployment Manager / Terraform
AKS / Container Instances GKE / Cloud Run

ure), Cloud Storage + BigQuery + Dataflow


w + Composer (GCP)
m Analytics (Azure), Pub/Sub + Dataflow
Top Competitors to Confluent (Apache Kafka as a Service)
Company / Product Category
Amazon MSK (Managed
Kafka-compatible Cloud Service
Streaming
Azure Event forHubs
Kafka)
(Kafka-
Kafka-compatible Cloud Service
compatible)
Redpanda Kafka API-Compatible Engine
WarpStream Kafka API-Compatible + S3 Storage
Aiven Kafka Managed Kafka Provider
Managed Kafka + Open Source
Instaclustr (by NetApp)
Stack
Cloudera DataFlow (CDF) Event Streaming + Processing
StreamNative (Pulsar) Apache Pulsar as Alternative
Datastax Astra Streaming Apache Pulsar + Kafka Bridge
RabbitMQ Traditional Message Broker
Apache Flink / Spark
Stream Processing Engines
Streaming
TIBCO / Software AG Enterprise Integration & Streaming
IBM Event Streams Kafka-based Enterprise Platform
Google Pub/Sub Native Cloud Pub-Sub System
Apache Pulsar (OSS) Open-source Kafka Alternative
NATS.io Lightweight Messaging System

Confluent Stands Out With:


ksqlDB (SQL for streams)
Confluent Cloud (fully managed, multi-cloud)
Schema Registry, RBAC, Tiered Storage
Commercial support + enterprise SLAs
e)
Key Features / Focus Area
Fully managed Kafka on AWS, deeply integrated with AWS
ecosystem
PaaS messaging system with Kafka protocol support
Built for performance, lower latency, no JVM, single binary
Kafka-like streaming built over object storage (S3) for cost-
efficiency
Multi-cloud managed Kafka service
Offers Kafka, Cassandra, Redis as managed services
Supports Kafka, NiFi, Flink, Spark etc., used in hybrid
environments
Multi-tenant, Kafka-like + message queue and stream features
Pulsar-based streaming with managed offering
Pub-sub & queue-based messaging (not streaming, but used
similarly)
Compete with Kafka Streams / ksqlDB for real-time data
processing
Streaming analytics, complex event processing
Secure, enterprise-grade Kafka for IBM Cloud or on-prem
Kafka alternative with autoscaling and global messaging
Multi-tenant, geo-replication, better queue + pub-sub semantics
High performance pub-sub system, alternative for
microservices messaging
Apache Project (Category) Purpose at Uber Where Used in Uber
Store massive amounts of trip,
HDFS (Distributed storage) Backend data lake
driver, historical
Query and rider tripdataand payment
Hive (SQL engine on Hadoop) Batch analytics & reporting
Spark (In-memory distributed data
Fast using SQL
data crunching for pricing, Dynamic pricing, ML
processing)
Flink / Beam (Unified batch & ETA
Manage predictions
both live trip events and models
Real-time fare validation,
stream processing)
Kafka (Distributed offline data
Event backbone for real-time stream analytics
Trip updates, driver status,
messaging/event stream)
Samza (Stream processing systems
Real-time processing of Kafka fraud detection
Rider surge detection, trip
framework) events
Schedule and manage ETL status
Runningupdates
analytics
Airflow (Workflow orchestration)
pipelines
Handle the flow of data from one Ingest logsML
pipelines, fromtraining
apps to
NiFi (Data flow orchestration)
Storm (Real-time stream system to another
Legacy system used for real-time storage
Alerts, anomaly detection
processing)
Druid (Real-time analytics on data
Analyze ride trends, user Real-time dashboards for
time-series)
HBase (NoSQL DB - Hadoop behavior
Store and retrieve massive ops and product
Backend systems, fraud
ecosystem) amounts of trip metadata detection
Cassandra (Wide-column NoSQL Handle high-availability user and Backend trip data, user
DB)
Solr / Lucene (Full-text trip
Searchdatafeatures for support and preferences
Customer support,
search/Indexing)
Zeppelin (Web notebook for ops teams
Interactive notebooks for data analytics dashboards
Data scientist tools
analytics)
Parquet (Columnar data storage exploration
Efficient storage of ride and fare
Data lake for analytics
format)
Avro (Row-based data format / data for querying
Transfer structured data between
Data pipelines
serialization)
Oozie (Workflow scheduler for services
Schedule batch jobs Legacy analytics pipelines
Hadoop jobs)
Camel (Integration framework - Route messages/data between Backend service
ESB style) systems communication
Apache Projects Used in LinkedIn (With LinkedIn Analogies)
Apache Project (Category) Purpose at LinkedIn Where Used in LinkedIn
Store huge volumes of member Data warehouse, offline
HDFS (Distributed storage)
data,
Run SQL job posts,
queries messages
on stored data storage
Business intelligence, data
Hive (SQL engine on Hadoop)
Spark (In-memory distributed for
Fast processing for data science, science
reports and insights Recommendation engines,
processing)
Flink / Beam (Unified batch & AI/ML
Process both real-time feed data content ranking
Feed ranking, notifications
stream processing)
Kafka (Distributed and offlineofengagement
Backbone trends
real-time event Activity stream (views,
messaging/event stream)
Samza (Stream processing streaming
Created by LinkedIn for real-time likes),
across systems logging, analytics
Feed generation, fraud
framework) processing of Kafka
Coordinate scheduled data data detection
Daily and hourly batch
Airflow (Workflow orchestration)
pipelines
Data ingestion from external and processes
Ingest click logs, app
NiFi (Data flow orchestration)
Storm (Real-time stream internal sources telemetry
Alerts, early versions of
Legacy real-time processing
processing)
Druid (Real-time analytics on Powering dashboards and analytics
Real-time insights into
time-series)
HBase (NoSQL DB - Hadoop analytics member activity
Messaging backend, profile
Fast access to large datasets
ecosystem)(Wide-column NoSQL
Cassandra views
Distributed profile and
Handle large-scale profile data
DB)
Solr / Lucene (Full-text messaging storage
Search engine for profiles, posts, Member search, job search,
search/Indexing)
Zeppelin (Web notebook for jobs content search
Exploration, hypothesis
Interactive tool for data scientists
analytics)
Parquet (Columnar data storage Efficient data storage for testing
Storing event logs and user
format)
Avro (Row-based data format / querying
Serialize and send structured data behavior
Kafka messages, data
serialization)
Oozie (Workflow scheduler for between services exchange
Older Hadoop pipeline
Schedule batch jobs
Hadoop
Camel (Integration framework - Move data and messages between coordination
jobs) Connecting backend
ESB style) services microservices

🏁 Summary in Layman Terms:


Imagine Uber as a city-wide transportation factory:
Some systems are the garage (storage like HDFS),
Some are dispatch radios (messaging like Kafka),
Others are smart assistants (stream processors like Spark, Flink),
And some are managers and planners (Airflow, Oozie),
While others are filing cabinets and search detectives (HBase, Cassandra, Solr),
And the accountants or analysts use tools like Hive, Zeppelin, and Druid.
Together, these Apache projects help Uber process millions of trips in real-time, detect fraud,
optimize routes, set pricing, and deliver insights to teams across the globe.
Layman Analogy (Uber Style)
Like a giant garage where every car (data file) ever used is parked
and labeled,
Like even
asking the if not Uber
central used HQ:
daily.“How many rides were taken in
NYC
Like Uber using multiple brainsa at
last month?” and getting clear
oncespreadsheet
to instantlyanswer.
calculate how
much to charge based on demand and traffic.
Like handling both live ride updates and monthly reports using one
system — likeradios
Like dispatch managing both today’s
constantly updating rides
theand
HQ:historical trends.
“Rider onboard,”
“Trip
Like ancomplete,” etc.
assistant at HQ instantly filtering important messages: “Oh,
there’s a surge in Manhattan!”
Like a manager planning what team runs at what time — “Run fare
audit
Like aattraffic
2am, then train newlogs
cop ensuring pricing
frommodel.”
every phone arrive at the right
Uber database garage.
Like someone shouting urgent updates to dispatch before more
modern tools
Like a live (Spark/Flink)
dashboard in Ubertook
HQover.
showing where demand is
spiking every 5 seconds.
Like a super-fast filing cabinet for ride records Uber needs to pull
instantly.
Like each rider/driver having their own notebook with all details
that
Likeisa available
detective anytime, even during
tool to instantly power
find “All cuts.
complaints in Chicago
about surge pricing.”
Like a shared Uber whiteboard where analysts run tests, write
notes, and collaborate.
Like storing only what matters in rows — just fares, just times —
for faster answers.
Like packaging ride details in a compact envelope to send between
systems.
Like the old school bell system telling teams when to start their
task
Like — replaced in many
a switchboard places
operator whoby Airflow.
makes sure the right Uber team
gets the right message.

Layman Analogy (LinkedIn Style)


Like a huge filing warehouse where every resume, message, and
job
Likepost is safely
analysts archived.asking “How many people changed jobs
at LinkedIn
last
Likequarter?”
multiple using spreadsheets.
LinkedIn engineers simultaneously calculating the
best jobs and connections
Like LinkedIn constantly updating to show you.
your feed and also learning from
past trends to improve it.
Like an internal broadcast network updating “User A liked a post”
and
Likepushing
a robot it everywhere
that watches all it activity
needs toand
go. instantly decides what to
do —ashow
Like it in feed,who
task manager alertmakes
someone,
sure log it, etc.
“every 4 hours, refresh the job
recommendation engine.”
Like a postal system making sure every click is delivered to the
right storage
Like older shelf. handling real-time user actions before Samza
systems
took
Like over.
a giant, live scoreboard showing “most viewed profiles” or
“top trending
Like skills this
a turbo-charged week.”
filing system that instantly retrieves “Who
viewed
Like my profile?”
giving every member their own high-speed personal vault for
resumes, posts,
Like the brain behindmessages.
“Search for Python developers in San Jose”
or “Find jobs with AI + remote.”where data scientists brainstorm
Like a collaborative whiteboard
“Why is engagement
Like stacking only thedown in India?”
important bits of every log neatly to get
faster answers.
Like putting a full resume into a neat envelope so one team can
pass
Like ita legacy
to another.
alarm clock still triggering nightly profile cleanup or
analytics jobs.

You might also like