Batch and Stream Processing in Confluent Cloud for Apache Flink¶
Confluent Cloud for Apache Flink® supports both batch and stream processing, which enables you to process data in either finite (bounded) or infinite (unbounded) modes. Understanding the differences between these modes is crucial for designing efficient data pipelines and analytics solutions.
Overview¶
Flink is a distributed processing engine that excels at both batch and stream processing. While both modes share the same underlying engine and APIs, they have distinct characteristics, optimizations, and use cases.
In Confluent Cloud for Apache Flink, batch mode is available by using snapshot queries.
Batch processing¶
Batch processing in Flink operates on bounded datasets, which are finite, static collections of data. This processing mode has the following key characteristics.
- It processes complete, finite datasets, like files or database snapshots.
- Batch jobs run to completion and then terminate.
- It is optimized for throughput, focusing on processing large volumes of data efficiently.
- Batch processing can sort, aggregate, and join across the entire dataset.
- The system can drop state as soon as it is no longer needed.
- Use cases: - Historical data analysis - ETL (Extract, Transform, Load) operations - Report generation - Data warehousing
Stream processing¶
Stream processing in Flink handles unbounded data streams, which have data that arrives continuously and might never end. This processing mode has the following key characteristics.
- It processes infinite, continuous data streams, such as Kafka topics or sensor feeds.
- Stream processing jobs run indefinitely, processing data as it arrives.
- It focuses on processing data with minimal delay for low latency.
- It produces incremental results as new data arrives.
- The system must retain state to handle late or out-of-order events.
- Use cases: - Real-time analytics - Fraud detection - IoT data processing - Live dashboards
Bounded and unbounded tables comparison¶
In Flink, tables can be either bounded (batch) or unbounded (streaming). The following table compares the key differences between these two modes.
Aspect | Bounded Mode (Batch) | Unbounded Mode (Streaming) |
---|---|---|
Data Size | Finite (static) | Infinite (dynamic, continuous) |
Processing Style | Batch processing | Real-time/continuous processing |
Query Semantics | All data available at once | Data arrives over time |
State Management | Minimal, can drop state when done | Must retain state for late/out-of-order data |
Use Cases | ETL, reporting, historical analytics | Real-time analytics, monitoring, alerting |
Differences between batch and stream processing¶
The following table compares the important differences between batch and stream processing.
Aspect | Batch Processing | Stream Processing |
---|---|---|
Data Model | Processes complete, finite datasets. | Processes infinite, continuous data streams. |
Execution Model | Jobs run to completion. | Jobs run continuously. |
Latency vs. Throughput | Optimized for high throughput. | Optimized for low latency. |
State Management | Minimal state, which is dropped when no longer needed. | Robust state, which is retained for late or out-of-order data. |
Fault Tolerance | Can restart from the beginning. | Requires checkpointing for fault recovery. |
Query Semantics | All data is available at once, so global operations are possible. | Data arrives over time, so results are incremental. |
SQL/API Differences |
|
|
Unified processing model¶
One important advantage of Flink is its unified processing model. This means that the same runtime engine handles both batch and streaming. The engine treats batch processing as a special case of stream processing. The same APIs and operators work for both modes. You can use the same code for both batch and streaming applications.
This unified approach enables you to:
- Build applications that process both historical and real-time data.
- Seamlessly transition between batch and streaming modes.
- Maintain consistent semantics across processing modes.
- Leverage the same tools and libraries for both paradigms.
Time and watermarks¶
Time and watermarks are important concepts in Flink that help you process data correctly.
- Batch mode: Time is fixed. All data is available, so event time and processing time are equivalent.
- Streaming mode: Time is dynamic. Streaming mode uses watermarks to track event time progress and handle out-of-order data.
- Windowing: In streaming, you use windows (tumbling, hopping, cumulative, session) to group data for aggregation. In batch, windows apply to static data.
For more information, see Time and Watermarks.
Determinism¶
Determinism is a key concept in Flink that helps you ensure that your queries always produce the same results.
- Batch: Re-running a batch job on the same data yields the same result, except for non-deterministic functions like UUID().
- Streaming: Results can vary due to timing, order of arrival, and late data. Determinism is harder to guarantee.
For more information, see Determinism in Continuous Queries.
Snapshot queries and batch mode¶
In Confluent Cloud for Apache Flink, batch mode is available by using snapshot queries.
- Snapshot queries: These are batch queries that automatically bound the input sources as of the current time.
- Batch optimizations: Batch mode enables optimizations like global sorting, blocking operators, and efficient joins. Snapshot queries benefit from these optimizations.
- Resource usage: Batch jobs, which are snapshot queries in Confluent Cloud for Apache Flink, release resources when finished. Streaming jobs hold resources as long as they run.
For more information, see Snapshot Queries.
Examples¶
The following code example shows a batch query.
-- Count all orders in a bounded table
SELECT COUNT(*) FROM orders;
The following code example shows a streaming query.
-- Count orders per minute in an unbounded stream.
SELECT window_start, window_end, COUNT(*)
FROM TABLE(
TUMBLE(TABLE orders, DESCRIPTOR(order_time), INTERVAL '1' MINUTE))
GROUP BY window_start, window_end;