0% found this document useful (0 votes)
9 views39 pages

Cloud-Agnostic Data Engineering Architecture For Real-Time I

This document provides a beginner-friendly guide to modern data engineering, focusing on real-time data streams from IoT devices, specifically trucks equipped with sensors. It discusses the importance of stream processing using tools like Kafka and Spark, and highlights the benefits of using optimized storage formats like Parquet over JSON for efficiency and cost savings in data storage and analytics. Additionally, it covers transformation processes and the advantages of cloud-agnostic architectures for handling large volumes of streaming data.

Uploaded by

Soumyadipta De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views39 pages

Cloud-Agnostic Data Engineering Architecture For Real-Time I

This document provides a beginner-friendly guide to modern data engineering, focusing on real-time data streams from IoT devices, specifically trucks equipped with sensors. It discusses the importance of stream processing using tools like Kafka and Spark, and highlights the benefits of using optimized storage formats like Parquet over JSON for efficiency and cost savings in data storage and analytics. Additionally, it covers transformation processes and the advantages of cloud-agnostic architectures for handling large volumes of streaming data.

Uploaded by

Soumyadipta De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Beginner-Friendly, For Data Engineers

Simplified Guide to Modern Data


Engineering: Understanding Data Streams,
Kafka, Spark, Avro, Parquet, S3 Tables, and
AWS Glue Catalog with Analogies and
Real-World Examples
Cloud-Agnostic Data Engineering
Architecture for Real-Time IoT Pipelines
Scenario

Imagine a fleet of thousands of trucks equipped with IoT sensors streaming millions of data
points per day: GPS coordinates, speed, fuel level, tire pressure, temperature, etc. You
want to:

• Collect, process, and store this data in real time


• Make it queryable for analytics and ML
• Ensure the system is cost-efficient, cloud-portable, and future-proof

Example Payload:

{
"truck_id": "TRUCK123",
"timestamp": "2025-06-25T14:00:01Z",
"latitude": 28.6139,
"longitude": 77.2090,
"speed": 62,
"fuel": 74,
"engine_temp": 88
}

🚛 Sample Truck Sensor Data

speed fuel engine_temp


truck_id timestamp latitude longitude
(km/h) (%) (°C)
TRUCK1 2025-06-
28.6139 77.2090 62 74 88
23 25T14:00:01Z
TRUCK4 2025-06-
19.0760 72.8777 55 68 92
56 25T14:00:02Z
TRUCK7 2025-06-
13.0827 80.2707 70 59 95
89 25T14:00:03Z

📘 Column Descriptions

Column
Type Description
Name
Unique identifier of the truck (e.g., license plate or fleet
truck_id String
code)
The exact time when the sensor reading was recorded
timestamp ISO-8601
(UTC)
latitude Float Current latitude of the truck (geo-location)
longitude Float Current longitude of the truck (geo-location)
speed Integer Current speed of the truck in kilometers per hour (km/h)
fuel Integer Remaining fuel level in percentage (%)
engine_temp Integer Current engine temperature in degrees Celsius (°C)

🔁 Use Cases

• Predictive maintenance (engine temp, tire pressure)


• Driver behavior analytics (speed, braking)
• Fuel efficiency tracking
• Geofencing & route optimization
• Cargo safety monitoring

We receive millions of IoT data points every day from sensors — and they come in the
form of a stream.

Stream = Like water flowing through a pipe — the data keeps coming in continuously, not
in chunks or files.

Since the data never stops, we can’t treat it like a traditional file upload.
Instead, we capture and process it in motion, then store it efficiently for analytics and
reporting.
📥 What is a Data Stream? (Simplified)

Think of a data stream like water flowing through a pipe — it's continuous, fast, and never
stops.

Now imagine:

Thousands of trucks →
Continuously sending GPS, speed, fuel info →
Like a constant flow of droplets into a pipe →
You need to collect, process, and store this flow in real time.

🛠️ Why Streaming Matters

You can’t wait until the end of the day to collect the data — it’s like trying to:

“Catch rain with a bucket only once a day. By the time you do, most of it has spilled away.”

So instead, we process the data while it's flowing — this is called stream processing.

💡 Analogy: Newsroom vs Daily Newspaper

Concept Streaming (e.g., Kafka) Batch (e.g., CSV dumps)


Analogy Live news updates Next-day printed newspaper
When it Once every few hours or
Instantly, continuously
happens daily
Data Volume Millions of records per day Aggregated data snapshots
Real-time decisions
Responsiveness Delayed insights
possible

🧱 Real-Life Simplified Example

Imagine this:

Every second, a truck sends:

• "I am here (lat/lon)"


• "I am going this fast"
• "I have this much fuel left"

And there are 10,000 trucks, each sending 1 message/second.


That’s 10,000 messages every second
→ 864 million records per day!

You can't write these one by one to files or a database.


So, we use a streaming system like Kafka, which acts like a real-time conveyor belt.

Kafka

Kafka is a publish/subscribe (pub/sub) system — and it serves a similar messaging role


as Redis Streams or RabbitMQ, but with important differences in scale, durability, and
purpose.

Let’s break it down with a simple analogy and comparison:

🧠 Analogy

Kafka, Redis Streams, and RabbitMQ are different types of message delivery services.

System Analogy
A cargo train with multiple containers that keeps history of all packages,
Kafka
and lets consumers re-read past deliveries.
Rabbit A post office — delivers messages directly to recipients once, usually in
MQ real-time.
Redis
A delivery van with limited storage — fast, but best for short-term
Stream
message retention.
s

🧾 Situation
You're receiving raw IoT data in Kafka — possibly millions of messages per day. Now you
want to store that data in S3.

But before storing, you often need a transformation step.


Why? Because raw data is rarely clean or efficient for long-term storage or analytics.
🧠 Common Transformation Scenarios (with Examples)
Transformation
Description Example Use Case
Type
Convert formats, align to "2025/06/25 14:05:01" →
Timestamp
timezone or truncate "2025-06-25T14:05:00Z"
Normalization
granularity (minute-level)
Unit Convert raw units to standard Speed in mph → kph, Fahrenheit →
Conversion ones Celsius
Sensor sent same reading twice due
Deduplication Remove repeated records
to retry
If fuel = null, set to previous known
Null Handling Drop or impute missing fields
value or "N/A"
Geo-Tagging / Map lat/lon to city, region, or
28.6, 77.2 → “Delhi NCR”
Mapping zone
Schema Flatten nested structures for {sensor: {temp: 85}} →
Flattening easier storage sensor_temp: 85
Error Fix incorrect but common "null" string → null, "speed": -
Correction values 1 → dropped
Add session boundaries or Group sensor data per trip or per
Sessionization
activity blocks engine cycle
Anomaly Add flags for outlier values engine_temp > 120 → "status":
Tagging using rules or models "anomaly"
Enrichment Join with external/static data Truck ID → add truck type, route,
with Lookup sources depot from DB
Tokenization / Obfuscate sensitive info before Replace driver ID with hashed ID or
Masking storing token
Deriving New "speed" > 80 →
Create new metrics or fields
Columns is_overspeeding = true
Event Infer higher-order events from Sequence of 0 speed →
Deduction raw data engine_off_event = true
Field
Rename inconsistent or cryptic
Renaming/Stan lat → latitude, ts → timestamp
field names
dardizing
Time-window Group data into intervals (e.g.,
Avg speed per truck per minute
Aggregation per minute/hour)
🔁 How Do We Transform Kafka Data?

✅ Using a Spark Job

Yes — a Spark job is a very common and powerful way to transform Kafka data in real-
time or in micro-batches.

• You write code (in Python, Java, or Scala) to:


o Connect to Kafka (as a subscriber)
o Read messages
o Apply transformations (cleaning, enrichments, etc.)
o Write the output to S3 (Parquet, Iceberg, etc.)

🏭 Real-Life Analogy: Data as Packages

Imagine you run a logistics hub:

• Kafka is the conveyor belt bringing in boxes from trucks non-stop.


• Each box contains raw data (could be messy, mislabelled, duplicated).
• Spark is your processing factory:
o It opens each box
o Inspects the contents
o Fixes issues (e.g., renames fields, removes duplicates)
o Adds labels (e.g., tags for location or time)
o Packs them neatly
o Sends them to storage (S3) in clean, compressed format (e.g., Parquet).

🔧 Spark Roles in the Pipeline

Role Factory Analogy Spark Equivalent


Raw material
Boxes arrive on a belt Kafka topics stream data into Spark
intake
Quality check Check for damage, missing parts Data cleaning (nulls, malformed records)
Fix wrong labels or standardize
Relabelling Rename fields, normalize units or formats
packaging
Join/enrich with metadata or reference
Assembly Combine with parts from storage
datasets
Route by destination (e.g., warehouse
Sorting & routing Partitioning by time, region, etc.
zone)
Write as Parquet with schema and
Final packing Compress and seal the final box
compression
Write to S3, Iceberg table, or data
Shipping out Send to final storage
warehouse

🕹️ Modes Spark Works In

Mode Analogy Description


Batch Mode Process one truckload at a time Historical data processing
Streaming Continuous assembly line, never Real-time or near-real-time data
Mode stops processing

📌 Key Takeaway

Apache Spark is like a smart, scalable factory that:

• Listens to data from Kafka (like an assembly line)


• Applies cleaning, transformation, enrichment
• Writes optimized output to your cloud storage (like a warehouse)

🧰 Are There No-Code / Low-Code Alternatives in AWS?

Yes, here are some options if you want to avoid writing Spark code:

1. AWS Glue Studio

• Drag-and-drop visual interface


• Can connect to Kafka, apply basic transformations, and write to S3
• Great for ETL, but not full real-time
• Backed by Spark under the hood
2. AWS Kinesis Data Firehose (with Lambda or Transformations)

• Stream from source (like Kafka or IoT Core)


• Apply lightweight transformation via AWS Lambda
• Auto-write to S3 in batches
• Limited but serverless and easy

Storing a month’s worth of data, with millions of records each day, will
require the same storage size in S3, regardless of whether it’s in JSON,
CSV, or Parquet format?

No, storing millions of IoT records per day in JSON, CSV, and Parquet will not take the
same size as S3. Here's how they compare:

📦 Example: 1 Day of IoT Data (10 million Records)

Estimated Size
Format With Compression
(Uncompressed)
JSON ~6–8 GB ~3–4 GB (Gzip)
CSV ~4–5 GB ~1–2 GB (Gzip)
Already compressed
Parquet ~1–1.5 GB
(Snappy/Zstd)

1 Month = Multiply by 30:

• JSON: ~180–240 GB (Gzipped JSON: 90–120 GB)


• CSV: ~120–150 GB (Gzipped CSV: 30–60 GB)
• Parquet: ~30–45 GB

✅ Why Parquet is Recommended for IoT Analytics:

• Columnar storage: Only read what you need (e.g., speed, GPS, not tire pressure).
• Built-in compression: Uses Snappy/Zstandard efficiently.
• Schema evolution: Compatible with Glue/Athena.
• Partitioning: Store by date, truck_id, etc., for faster queries.
🚀 Optimization Tips for S3 Storage

• Use Apache Parquet + Snappy/Zstd compression.


• Partition by year/month/day or truck_id.
• Use AWS Glue Catalog + Athena for queryable data lake.
• Use S3 lifecycle policies to move older data to S3 Glacier if needed.

📦 S3 Storage Structure

s3://your-bucket/data/truck-iot/
└── year=2025/
└── month=06/
└── day=25/
├── part-0000.parquet
├── ...

storing a 5MB JSON file in Parquet format in S3 will not take the same amount of space
— in fact, Parquet will almost always take significantly less space.

📊 JSON vs Parquet Storage Efficiency


Form
Original Size On S3 (Approx.) Compression Structure
at
~5 MB (raw) / ~2–3 MB Line-by-line compression Text, redundant
JSON 5 MB
(Gzipped) only keys
Parq 5 MB JSON Binary +
~500 KB – 1.5 MB Columnar + Snappy/Zstd
uet equivalent columnar

🧪 Why Parquet is Smaller


1. Columnar format: Groups values of each field together → great for compression.
2. Efficient encoding: Uses binary encoding, dictionaries, and run-length encoding.
3. Built-in compression: Uses fast algorithms like Snappy or Zstandard.
4. No repeated keys: JSON stores key names in every row, Parquet stores them once
in metadata.

🔁 Example (Real-World Estimate)


Data Format Description S3 Size
Raw JSON 5 MB sensor JSON ~5 MB
Gzipped JSON Compressed with gzip ~2.5–3 MB
Parquet Same data, converted to
~600–900 KB
(Snappy) Parquet

Parquet typically gives 3–8x size reduction vs raw JSON.

S3 is inexpensive, so why should we be concerned about


space?
It's true that S3 is relatively cheap, but storage efficiency still matters, especially at
scale and when combined with other hidden costs.

Here’s why space matters even when S3 seems cheap:

💰 1. S3 Cost Adds Up at Scale


Daily JSON Size Parquet Size Monthly Cost
Ingestion (Uncompressed) (Compressed) Difference
5 GB/day ~150 GB/month ~30–50 GB/month Up to 5× lower cost

Example:

• S3 Standard: $0.023/GB/month
• 150 GB JSON = ~$3.45/month
• 30 GB Parquet = ~$0.69/month
• Multiply this across 100 datasets → big impact
🚀 2. Athena/Glue Query Costs Are Based on Data
Scanned
• Athena charges $5 per TB scanned
• Querying JSON scans entire file, including keys
• Parquet enables column pruning: only reads what's needed
• This means 80–95% reduction in query cost

Example:

• Query 1 TB JSON = $5
• Same query on Parquet = ~$0.25–$0.5
• Query cost > storage cost in most analytics workloads

📉 3. Performance & Speed


• Parquet files load 5–50x faster than JSON in tools like:
o Spark, Glue, Athena, Redshift Spectrum, Trino
• Less data to scan → faster results
• Faster queries = cheaper EMR/Glue job run times = saved

✅ Summary: Why Space Still Matters


Factor JSON (raw) Parquet (optimized)
Storage Cost Low initially Much lower at scale
Low (columnar,
Query Cost High
compressed)
Performance Slower Much faster
Long-Term Use Inefficient Preferred industry-wide
Total Cost of
Higher Lower
Ownership

📌 Bottom Line:

S3 is cheap. But querying, transferring, and scaling unoptimized data is not.


So, even though S3 itself is inexpensive, optimizing your data (e.g., Parquet over JSON)
saves you money, improves performance, and ensures your data lake scales intelligently.

📦 What is Avro?
Avro is a row-based, binary, compact data format developed by Apache for:

• Efficient serialization (like zipping data)


• Schema-based storage
• Data exchange across systems

Think of Avro as a “structured zip file” for data rows that’s fast to serialize/deserialize and
works well in streaming or inter-system communication.

🧪 Sample Avro Use Case

🔄 Scenario: Sensor Data Ingestion (Kafka → S3)

• Your IoT sensors send data like this:

{
"truck_id": "TRUCK123",
"timestamp": "2025-06-26T14:05:00Z",
"speed": 62,
"fuel_level": 73
}

📦 Store in Avro (instead of JSON or Parquet)

• You define a schema (once), e.g.:


o truck_id: string
o timestamp: timestamp
o speed: int
o fuel_level: int
• Then every data record just stores values efficiently in binary, like:
o "TRUCK123", 2025-06-26T14:05:00Z, 62, 73
This keeps files compact, self-described, and very fast to read/write.

✅ When to Use Avro (vs Parquet/JSON)


Use Case Use Avro? Why
Kafka / real-time pipelines Best fit Works natively with Kafka
Good for row-wise
Streaming data into S3 Yes
ingestion
Efficient with schema
Spark ETL intermediate format Yes
evolution
Use Parquet is better for
SQL analytics on big datasets
Parquet queries
Interoperable data exchange between Language-independent
Yes
systems format
Use Better
Long-term archival + analytics
Parquet compression/querying

🆚 Avro vs Parquet vs JSON (Simple Table)


Feature JSON Avro Parquet
Format Type Text Binary Row Binary Column
Human Readable
Compression Poor Good Excellent
Query
Poor OK Excellent
Performance
Streaming Best Fit
Schema Evolution Great Limited

🧠 Summary
• Use Avro for streaming, transport, schema evolution.
• Use Parquet for analytics, SQL queries, storage efficiency.
• Use JSON for debugging, testing, or APIs (not for large-scale storage).
Why is Avro better for transport and streaming than
Parquet? — with simple explanations and a real-world
example.

🚚 Why Avro is Better for Transport/Streaming Than


Parquet?
Feature Avro Parquet
Format Type Row-based (record by record) Column-based (optimized for batch)
Yes – write/read record by No – needs full row group for
Streaming Friendly
record write/read
Serialization Speed Fast (small chunks, real-time) Slower (bulk write needed)
Kafka/Spark Integration Native support in Kafka, Spark Not designed for streaming transport
Size + Schema Efficient with schema
Heavier, designed for large reads
Compactness embedded
Excellent (handles new/old
Schema Evolution Limited (needs consistent schema)
fields)

🔄 Simplified Example

🛠 Scenario:

Your IoT system sends 1,000 sensor readings per second from truck devices to a
backend.

🔄 Option A: Avro over Kafka

• Each record is encoded in Avro and published to Kafka instantly.


• Kafka consumers (e.g. Spark) read one record at a time and process it.
• Very low-latency, ideal for real-time streaming.

Use Avro → works record-by-record, schema embedded, no buffering required.


🚫 Option B: Parquet over Kafka

• Parquet expects a large block of records (called a row group).


• To send 1,000 small records, you must buffer them in memory first.
• Delays streaming → introduces latency and complexity.
• Not suitable for direct stream transport.

Parquet is designed for batch processing and analytics, not real-time transport.

✅ Summary: When to Use Avro over Parquet


Situation Use Avro? Use Parquet?
Real-time ingestion (Kafka) Yes No
Event-driven microservices Yes No
Stream processing (Spark,
Yes Maybe for sinks only
Flink)
Long-term storage / SQL
No Yes
queries

🧠 Final Thought
Avro = efficient wire format for sending data
Parquet = efficient disk format for querying data

Would you like a real-world data pipeline example showing:


Kafka → Avro → Spark → Parquet in S3 with Glue Catalog integration?

📦 Why Parquet Requires Buffering Before Writing


Parquet is a columnar storage format, not a row-by-row writer like JSON or Avro. Here’s
why buffering is required:

🧱 1. Column-Oriented Format Needs Row Groups

• Parquet organizes data in row groups, each containing columns stored separately.
• To create a row group, it must reorganize records column-wise:
Input rows:
Row 1: [a1, b1, c1]
Row 2: [a2, b2, c2]

Parquet needs to buffer:


Column A: [a1, a2]
Column B: [b1, b2]
Column C: [c1, c2]

• This requires gathering a batch of rows in memory first.

📏 2. Minimum Row Group Size (Performance & Compression)

• A typical row group size is 128 MB or more.


• If you write one row at a time, you:
o Waste compression
o Create tiny files (bad for performance)
o Break schema consistency

So, to write efficiently, Parquet libraries buffer many records (e.g., 10,000 rows)
before writing.

🔄 3. Compare to Avro (Row-Based)

• Avro stores records one after another, like a stream:

Row 1 → Row 2 → Row 3 → ...

• No need to shuffle fields into columns.


• Can write directly as records arrive (ideal for streaming).

🧪 Real Example

Parquet

• Your Spark job receives 1 record/sec.


• It waits to collect ~10,000 records (~128 MB) before writing a single Parquet file.
• This introduces latency and requires in-memory buffering.

Avro

• Your Spark job receives 1 record/sec.


• Each record is serialized and written immediately — no waiting, no batching.

🚫 Why Not Write Parquet Row-by-Row?


• Parquet is not designed for that — doing so:
o Destroys compression benefits
o Produces too many small files
o Increases read cost and query latency
o Causes poor performance in Athena/Glue

How Much Time Does Avro → Spark → Parquet Take?

⚙️ Factors Affecting Transformation Time

Stage Depends On
Record size, Avro schema complexity, batch
1. Reading from Kafka (Avro)
size
2. Deserialization to Spark
Avro library + schema inference time
DataFrame
3. DataFrame transformations Business logic, UDFs, repartitioning
4. Writing to Parquet (S3) Partitioning, compression codec, file size
5. Network/S3 write latency Region, file sizes, concurrency

🔍 Typical Benchmarks (Real-World Estimates)


Records per Avro Payload Spark Transformation Time (ETL Write to Parquet
batch Size only) Time
100,000 2 KB each ~1–3 seconds ~3–5 seconds
500,000 2 KB each ~5–8 seconds ~8–12 seconds
1,000,000 2 KB each ~10–15 seconds ~15–25 seconds
Numbers assume:
• 4–8 vCPU Spark executors
• Data partitioned by date/truck_id
• Snappy compression
• S3 bucket in same region as Spark job (minimal network overhead)

🧪 Breakdown of Each Stage

1. Kafka → Spark Avro Deserialization

• Very fast with Confluent AvroDeserializer


• Takes ~1–2 seconds per 100k records

2. DataFrame Transformations

• Depends on complexity:
o Simple column selection: negligible (<1s)
o Enriching with static datasets: 1–3s
o Heavy joins/UDFs: 5–10s+

3. Write to Parquet in S3

• File size matters:


o Target ~128MB per file
o Snappy compression helps
• Parallelism (number of partitions/executors) matters
Why do we need a schema registry when sending Avro
formatted data to Kafka? How will it help?

🧠 What is a Schema Registry?


A Schema Registry is a centralized service that:

• Stores Avro (or Protobuf/JSON) schemas


• Assigns them unique IDs (schema versions)
• Helps producers and consumers agree on what data means
• Enables safe evolution of data formats

Think of it like a contract manager for your Kafka messages.

❓ Why Do You Need It with Avro?

🔧 Problem Without Schema Registry:

• Avro requires a schema to read data.


• If you embed full schema in every message, it's:
o Big (adds bloat to every message)
o Hard to manage if schemas change over time
• Consumers must somehow guess or infer schema — very fragile!

✅ Schema Registry Solves This:

1. Producers send only schema ID, not full schema


a. Makes Avro messages tiny and fast
2. Consumers fetch schema by ID when reading
a. No need to hardcode schema in code
3. Enables safe evolution (e.g., add a field, change type)
a. Schema Registry validates compatibility
4. Versioning is built-in
a. You can evolve schemas without breaking consumers
🔄 Message Flow with Schema Registry

🚚 When Producer Sends:

• Serializes message using Avro


• Adds schema ID (e.g., 0x10) as a prefix
• Sends to Kafka topic

🧾 Message on Kafka looks like:

css
CopyEdit
[MagicByte][SchemaID][AvroBinaryData]

🧑‍💻 When Consumer Reads:

• Reads schema ID from message


• Fetches schema definition from Schema Registry
• Deserializes Avro message properly

📦 Example

Original Schema:

{
"type": "record",
"name": "TruckData",
"fields": [
{"name": "truck_id", "type": "string"},
{"name": "speed", "type": "int"}
]
}
Later, you add:

{"name": "fuel_level", "type": "int", "default": 100}

• Schema Registry stores both versions


• Producer starts using v2
• Consumers using v1 still work because of default

This is called backward compatibility — Schema Registry ensures it.

✅ Benefits of Schema Registry with Avro + Kafka


Benefit Why It Matters
Smaller Kafka messages Schema not repeated per message
Central schema
Easy to track versions, owners
management
Compatibility checks Prevent breaking changes
Safe evolution Add/remove fields safely
Language interoperability Java → Python → Go → all speak Avro

✨ What Is the Magic Byte?


The Magic Byte is a single byte (1 byte) at the start of every Avro message sent to Kafka
when using a Schema Registry.

🧱 Purpose:

It tells the Kafka consumer:

“This message was serialized using Schema Registry format.”

🔍 Message Structure with Avro + Schema Registry


When you send a message to Kafka with Avro (and Confluent Schema Registry), the
message format is:
[ Magic Byte ][ Schema ID (4 bytes) ][ Avro Serialized Data ]

Byte(s) Purpose
0x00 (1 byte) Magic Byte — indicates format
Schema ID — tells consumer which schema to fetch from
4 bytes
registry
Remaining
Avro binary data
bytes

🔢 Why It's Called “Magic”?


It’s a common computer science pattern:

• A magic number (or byte) identifies the format of a file or stream.


• Example: PDF files begin with %PDF, PNG files start with \x89PNG, etc.
• In Kafka, 0x00 is the magic byte for Confluent Avro format.

🧑‍💻 How It Helps the Consumer


1. Sees 0x00 → knows this is Avro with Schema Registry.
2. Reads next 4 bytes → finds schema ID (e.g., 0x0000001f = ID 31).
3. Fetches schema from Schema Registry.
4. Deserializes the rest of the message.

Without the magic byte, the consumer wouldn’t know how to interpret the message — it
might crash or give incorrect results.

📦 Example: Raw Bytes in a Kafka Message


Imagine this byte stream:

[0x00][0x00][0x00][0x00][0x1f][Avro data bytes...]

• 0x00 → magic byte


• 0x0000001f → schema ID 31
• [Avro data bytes...] → compressed binary representation of your sensor data

🔷 Magic Byte = Fixed Identifier (Like a Port Number)


Concept Meaning
Magic Byte (0x00) A fixed, unique identifier that tells the consumer:
“This message uses Avro +
Schema Registry format.”
It’s always 0x00 for Avro messages using Confluent
Why it's fixed
Schema Registry
Like a fixed port number that routers recognize for
Analogy
specific protocols (e.g., Port 80 = HTTP)

🎯 How It Works in Practice

• Producer sends Avro message → prepends 0x00 + schema ID


• Kafka stores the full byte stream
• Consumer reads the message:
o Sees 0x00 → “Ah! This is a Confluent Avro message.”
o Reads next 4 bytes → schema ID (e.g., 31)
o Uses that ID to fetch schema from Schema Registry
o Deserializes the binary data safely

📦 Format Recap
Byte Offset Size Field Value Example
0 1 byte Magic Byte 0x00
0x0000001f
1–4 4 bytes Schema ID
(31)
5+ variable Avro Payload Serialized binary
🧠 Why It Matters
• Prevents misreading or deserialization failures
• Supports multiple message types in Kafka (JSON, Protobuf, etc.)
• Helps polyglot systems (Java, Python, Scala) understand what they’re reading

Should we continue storing sensor data in S3 tables or


Iceberg, or is it no longer necessary?

Yes, you absolutely should still store sensor data in S3 tables — and using Apache
Iceberg (or Hudi/Delta) makes it even better.

Here’s a clear explanation of why it’s needed, and when just Kafka is not enough.

🧪 Why Not Just Kafka?


Kafka is excellent for real-time data transport, but:

Limitation Kafka (Avro)


Data retention Short (default 7 days), not for long-term storage
Historical
No — hard to scan years of data from Kafka
analysis
Cost Costly to keep high-volume topics forever
Query with SQL No built-in support — requires complex tooling
Schema Handled in messages, but no time travel or
evolution deletes

So, Kafka is not your data warehouse or data lake — it’s a pipe, not a lake.

🪣 Why Store Sensor Data in S3 Tables (with Iceberg)?


Feature Benefit for Sensor Data
Long-term storage Archive months/years of data cheaply
Columnar format (Parquet) Save 70–90% space vs raw JSON
Partitioning (by day/device) Fast queries, cheap scans
Schema evolution (with
Add fields safely over time
Iceberg)
View older snapshots (e.g., “last week’s
Time travel & versioning
data”)
SQL queries (Athena, Trino) Query directly from S3 using SQL
Integration with BI/ML tools Power dashboards, alerts, models

📦 Iceberg/Hudi/Delta Lake Adds Superpowers


Parquet Alone Iceberg/Hudi/Delta
Static files Table abstraction
Manual partition
Automatic partition evolution
handling
Supports upserts, deletes,
No upserts/deletes
merges
No versioning Snapshots + time-travel
Works better, scalable
Athena/Glue works
metadata

So: you turn raw files into queryable, governed, versioned tables — like a lakehouse.

IOT sensor data is stored in S3 in parquet format. How can


S3 tables help in this context?

📂 What You Have Now


Componen
Purpose
t
Parquet Efficient columnar file format
S3 Scalable object storage
Sensor data is stored cheaply, efficiently in
So far:
S3

But without S3 Tables (e.g., Iceberg), you're missing all the data lakehouse capabilities.

🧱 What Are “S3 Tables” (Like Iceberg)?


S3 tables are table abstractions on top of your raw Parquet files.
They organize, track, and manage those files as logical tables, just like a database.

You gain all the benefits of SQL tables — but on cheap object storage like S3.

✅ Why You Need S3 Tables for IoT Data (Even After


Storing in Parquet)
Problem Without S3 Tables How S3 Tables Help (e.g., Iceberg)
No easy way to query years of
SQL queries over years, filtered by time/device
data
Manual partition handling Automatic partition evolution & pruning
Can't handle schema changes Schema evolution supported (add/remove fields)
Hard to do updates/deletes Supports upserts & deletes (e.g., for GDPR, corrections)
Can't time travel Query data “as of” past snapshot (e.g., last week)
No ACID guarantees Reliable inserts, merges, and transactional writes
Registers table metadata → makes files queryable as SQL
Glue/Athena sees raw files
tables

🧪 IoT Use Case Example

Without S3 Tables:

• You dump daily Parquet files into:

s3://iot-data/sensor_data/2025/06/26/...
• You query via Athena, but:
o You must manage partitions manually
o No version control or history
o No clean way to update records (e.g., sensor correction)

With Iceberg Table on S3:

Now your table is defined like:

iot_sensor_data (
truck_id STRING,
timestamp TIMESTAMP,
speed DOUBLE,
fuel_level INT,
location STRUCT<lat: DOUBLE, lon: DOUBLE>
)

And now you can:

• Do partition pruning (e.g., query only yesterday)


• Run:

SELECT * FROM iot_sensor_data


WHERE truck_id = 'TRUCK-123'
AND timestamp > now() - interval '1 day'

• Run time-travel:

SELECT * FROM iot_sensor_data


FOR TIMESTAMP AS OF '2025-06-25 08:00:00'

• Run incremental processing:

SELECT * FROM iot_sensor_data


WHERE __change_type = 'insert' AND __ts > last_processed_time
Are we duplicating data? We have the same information
stored in both a Parquet file and an S3 table.

🤔 Are We Duplicating Data When We Use S3 Tables Like


Iceberg?
No — you're not duplicating the actual data. You're layering metadata on top of your
Parquet files.

🧱 What Gets Stored?


Component Stored Where What It Contains
Parquet files In S3 The actual columnar data (sensor readings)
Iceberg Also in S3 (as Table schema, partition info, snapshot history,
metadata files) versioning
Glue Catalog A pointer to the Iceberg table and schema (like a
In Glue (or Hive)
entry registry)

Parquet files are reused, not copied or rewritten unless you perform updates.

📦 So What Is a “Table” in Iceberg or S3 Tables?


A table in Iceberg (or Hudi/Delta) is really:

A smart metadata layer that tracks, manages, and optimizes access to your existing
Parquet files.
✅ Example: Without vs With Iceberg Table

📁 Raw Parquet Storage Only

You dump files here:

s3://iot-data/sensor_data/2025/06/26/part-001.parquet

You manually query via Athena:

SELECT * FROM "iot_db"."sensor_data"

But:

• No versioning
• No upserts
• No schema tracking

🧠 With Iceberg Table

Iceberg creates these additional files:

kotlin
CopyEdit
s3://iot-data/tables/iot_sensor_data/
├── metadata/
│ ├── v1.metadata.json
│ ├── snapshots/
│ └── manifest-lists/
└── data/
├── part-001.parquet
├── part-002.parquet

Now you can:

• Track schema changes


• Optimize queries
• Do ACID updates
• Use Spark, Trino, Flink to write/manage cleanly
The same Parquet files are being used — they’re just managed intelligently now.

🚫 No Duplication, But…
Causes Data
Operation Why?
Duplication?
Just creating Iceberg
No Metadata layer only
table
Writes Parquet once, tracks in
Insert new data No
metadata
Old files rewritten, but versions are
Update/delete data Maybe (controlled)
tracked
May rewrite files to optimize
Compaction Maybe (optional)
performance

But these are intentional, optimized operations, not mindless duplication.

🧠 Are We Storing Twice?


Concern Answer
Is data stored twice? No — same Parquet reused
Are Parquet files duplicated? Not unless you update/compact
Is metadata stored separately? Yes — lightweight, versioned JSON
Are you paying twice for No — only for actual Parquet + tiny
storage? metadata

✅ Final Verdict
Iceberg, Hudi, and Delta do not duplicate your sensor data in S3.
They add a smart, versioned table layer on top of your existing Parquet files — unlocking
powerful features without duplicating storage.
🧱 What’s Actually Inside an S3 Table?
When you create a table using Apache Iceberg or other table formats on top of S3 Parquet
files, you’re not just storing Parquet data — you’re storing a rich metadata structure
alongside it.

📁 Iceberg Table Folder Structure in S3

kotlin
CopyEdit
s3://your-bucket/iot-sensor-table/
├── data/
│ ├── part-00000.parquet
│ ├── part-00001.parquet
│ └── ...
├── metadata/
│ ├── v1.metadata.json
│ ├── v2.metadata.json
│ ├── snapshots/
│ │ ├── snapshot-001.json
│ │ ├── snapshot-002.json
│ │ └── ...
│ ├── manifest-lists/
│ ├── manifests/
│ └── ...

📦 Breakdown of Each Folder

1. /data/

• Contains your actual Parquet files


• IoT sensor records live here (e.g., temperature, speed, location)

These are your facts — your raw and transformed data.


2. /metadata/

This is what turns a folder of files into a real table.

File/Folder Purpose
v1.metadata.j Top-level Iceberg table metadata (schema, partition spec, format,
son etc.)
snapshots/ Every snapshot = version of the table (think Git commit)
manifest-
List of manifest files (groups of data files) for a given snapshot
lists/
manifests/ Lists of actual Parquet files and their stats (min/max per column)

Iceberg tracks every file added, removed, and every schema change — through
snapshots and manifests.

🔍 Example Inside v2.metadata.json

{
"format-version": 2,
"table-uuid": "abc-123",
"schema": {
"fields": [
{"id": 1, "name": "truck_id", "type": "string"},
{"id": 2, "name": "timestamp", "type": "timestamp"},
{"id": 3, "name": "speed", "type": "double"}
]
},
"partition-spec": [{"field-id": 2, "transform": "day"}],
"current-snapshot-id": 98765,
"snapshots": [...],
...
}
🕘 Time Travel Example
Each snapshot-xxx.json file contains:

• A snapshot ID
• Timestamp of commit
• List of manifests used in that snapshot

So if your sensor data changed on 2025-06-25, you can query the table "as it was"
before that change.

✅ Summary: What’s Inside an Iceberg (S3) Table?


Component Stored in S3 as... What it Contains
Actual data /data/*.parquet Sensor data rows
Schema & metadata/vN.metadat Column names/types,
structure a.json partitioning
metadata/snapshots/
Snapshots Versions of the table
*.json
metadata/manifests/ What files are active in each
File listing
*.avro snapshot
Stats for manifests & Min/max values, row counts, per
optimization manifest-lists file

🧠 Why This Matters


• Enables ACID transactions on object storage (S3)
• Tracks every change = audit trail
• Supports fast SQL queries with min/max file skipping
• Allows schema evolution safely
• Enables incremental queries for streaming use cases
Do we also need to prioritize having the glue catalog at the
top of everything?

🧱 What Is AWS Glue Catalog?


The Glue Data Catalog is a centralized metadata store (like a database of table
definitions).

It stores information about:

• Table names
• Schemas
• Partition columns
• Table location (S3 path)
• Format (Iceberg, Parquet, Delta, CSV, etc.)
• Column types, descriptions, and more

Think of it as:

The "registry" that helps tools know what data you have, where it is, and how to query
it.

📍 Where Does Glue Catalog Fit in Your Pipeline?

Your Architecture:

IoT Data (Kafka → Spark → Iceberg on S3)



Iceberg Table (S3-based)

Glue Catalog (table metadata registration)

Athena / Redshift / Trino / SageMaker / EMR
✅ Why You Need Glue Catalog on Top of Iceberg or
Parquet
Benefit Why It Matters
Query with Athena Athena uses Glue to discover tables and schemas
Redshift Spectrum integration Redshift queries data in S3 via Glue
SageMaker Feature Store Can ingest directly from Glue-registered tables
Spark/Hive engines read Glue Catalog for schema and
EMR + Hive support
partitions
Data Governance (Lake
Access control is applied via Glue table metadata
Formation)
Tableau, PowerBI, Looker use Glue as metadata
BI Tool Integration
discovery layer

🔍 Example: Registering an Iceberg Table in Glue


When you create an Iceberg table, you can register it like:

CREATE TABLE iot_db.sensor_data (


truck_id string,
timestamp timestamp,
speed double,
fuel_level int
)
PARTITIONED BY (day(timestamp))
STORED AS ICEBERG
LOCATION 's3://your-bucket/iot-sensor-data/'
TBLPROPERTIES ('table_type'='ICEBERG');

This table now:

• Points to S3 location
• Appears in Glue Catalog under iot_db.sensor_data
• Can be queried with Athena:
SELECT * FROM iot_db.sensor_data
WHERE truck_id = 'TRUCK123'

☁️ Glue Catalog Makes Iceberg Table “Cloud-Native”


Without Glue Catalog With Glue Catalog
Manual access to S3
One-click SQL access in Athena
paths
Lake Formation + IAM control
No governance
enabled
Limited discovery Fully searchable schema registry
Integrates with all AWS analytics
No BI integration
tools

✅ Do You Need Glue Catalog on Top of Iceberg Tables?


Scenario Glue Catalog Needed?
Query from Athena or Redshift Yes
Want central schema/partition
Yes
registry
Use SageMaker, EMR, Lake
Yes
Formation
Just Spark or Flink with custom Optional (but still
catalog useful)

Yes — Glue Catalog is a critical layer for discoverability, governance, and cross-
service integration.

🔧 Essentials
Layer Tool/Tech
Ingestion Kafka + Avro
Processing Spark
Storage S3 + Parquet
Table Format Iceberg
Metadata Glue Catalog

🧠 Final Thoughts — Architecture Summary


┌────────────┐
IoT Sensors ───▶ │ Kafka (Avro) ├───▶ Schema Registry
└────────────┘


┌────────────────┐
│ Spark / Flink │
└────────────────┘


┌───────────────────────────┐
│ Iceberg Tables on S3 (Parquet) │
└───────────────────────────┘

┌────────────┬───┴───────────────┬──────────────┐
▼ ▼ ▼ ▼
Glue Lake Formation Feature Store Trino/Athena
Catalog (Access Control) (ML Features) (SQL/BIs)

⚙️ Technologies Used (with Analogies)

Technology Role in Pipeline Analogy


Real-time ingestion + compact Post office +
Kafka + Avro
serialization compressed envelope
Contract manager
Schema Registry Central schema store, versioning ensuring everyone
agrees
Stream/batch processing and Factory conveyor belt
Spark / Flink
transformation (transform & enrich)
Excel sheet zipped by
Parquet Columnar storage format
column for fast access
Table abstraction on Parquet (ACID, Warehouse manager
Apache Iceberg
time travel, schema evolution) with a full ledger
Object Storage
A giant, organized
(S3/GCS/Azure Scalable, cheap storage
storage room
Blob)
Metadata Catalog
Phonebook that lets
(Glue/Hive/Nessie Metadata registry
tools find the right table
)
Google search bar for
Athena / Trino SQL query layer over Iceberg tables
your warehouse
Lakehouse Security and
Governance & access control
Governance Tools gatekeeping system

📘 Beginner-Friendly Intent

This document was created with the intention to help beginners — like myself — learn the
basics of data engineering in a simple, analogy-driven way.

It walks through real-time pipelines step-by-step:

• Explains why each technology is used (not just what it is)


• Uses real-world analogies (factories, packages, water pipes)
• Shows when and how to apply tools like Kafka, Spark, Parquet, and Iceberg

Whether you’re a developer, analyst, or new engineer exploring data systems, this guide
provides a single, simplified reference for building modern, scalable pipelines.

Author Note

Author: Soumyadipta De

Role: Technical Architect / Project Manager with 23+ years of experience in Application
Development, Serverless, Microservices, AWS, Kubernetes, and DevOps.

LinkedIn: Soumyadipta De

You might also like