flink 自定义窗口_Flink从入门到放弃(九)-window&time概念理解

最新推荐文章于 2024-08-09 11:28:10 发布

最新推荐文章于 2024-08-09 11:28:10 发布 · 315 阅读

文章标签：

#flink 自定义窗口 #flink入门 #window新建窗口中打开应用

本文详细介绍了Flink中的Window概念，包括基于时间的窗口和基于数量的窗口。探讨了Window在处理无限流时的作用，并对比了ProcessingTime、EventTime和IngestionTime的区别。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

开始到window了，先回顾下入门版概念中对window的定义：

Window的定义

window：用来对一个无限的流设置一个有限的集合，在有界的数据集上进行操作的一种机制。window 又可以分为基于时间(Time-based)的 window 以及基于数量(Count-based)的 window。 KeyedStream→WindowedStream，注意datasource类型有变化可以在已经分区的KeyedStream上定义Windows。Windows根据某些特征(例如，在最后5秒内到达的数据)对每个Keys中的数据进行分组。有关https://flink.xskoo.com/dev/stream/operators/windows.html的完整说明，请参见windows。

    dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data

tumbling time windows(翻滚时间窗口) -- 不会有窗口重叠，也就是一个元素只能出现在一个窗口中
sliding time windows(滑动时间窗口)--会有窗口重叠，也就是一个元素可以出现在多个窗口中

    data.keyBy(1)        .timeWindow(Time.minutes(1)) //tumbling time window 每分钟统计一次数量和        .sum(1);        data.keyBy(1)        .timeWindow(Time.minutes(1), Time.seconds(30)) //sliding time window 每隔 30s 统计过去一分钟的数量和        .sum(1);

timeWindow: 如上所说，根据时间来聚合流数据。例如：一分钟的 tumbling time window 收集一分钟的元素，并在一分钟过后对窗口中的所有元素应用于一个函数。
windowAll: DataStream→AllWindowedStream，可以在非分区的数据上直接做windowAll操作 Windows可以在常规DataStream上定义。Windows根据某些特征(例如，在最后5秒内到达的数据)对所有流事件进行分组。有关https://flink.xskoo.com/dev/stream/operators/windows.html的完整说明，请参见windows。

    **警告：**在许多情况下，这**是非并行**转换。所有记录将收集在windowAll 算子的一个任务中。

    dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data

官方文档中关于Window的说明：

Windows是处理无限流的核心。Windows将流拆分为有限大小的“桶”，我们可以在其上应用计算。本文档重点介绍如何在Flink中执行窗口，以及程序员如何从其提供的函数中获益最大化。

窗口Flink程序的一般结构如下所示。第一个片段指的是被Keys化流，而第二个片段指的是非被Keys化流。正如人们所看到的，唯一的区别是window(...)针对keyby之后的keyedStream，而windowAll(...)针对非被Key化的数据流。

被Keys化Windows

stream       .keyBy(...)

非被Keys化Windows

stream       .windowAll(...)

在上面，方括号([...])中的命令是可选的。这表明Flink允许您以多种不同方式自定义窗口逻辑，以便最适合您的需求。

window VS timeWindow

flink中keyedStream中还有一个timeWindow方法，这个方法是在window的基础上做的封装，看下代码实现：

    /**     * Windows this {@code KeyedStream} into tumbling time windows.     *     *

This is a shortcut for either {@code .window(TumblingEventTimeWindows.of(size))} or * {@code .window(TumblingProcessingTimeWindows.of(size))} depending on the time characteristic * set using * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)} * * @param size The size of the window. */ public WindowedStream timeWindow(Time size) { if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) { return window(TumblingProcessingTimeWindows.of(size)); } else { return window(TumblingEventTimeWindows.of(size)); } } /** * Windows this {@code KeyedStream} into sliding time windows. * *

This is a shortcut for either {@code .window(SlidingEventTimeWindows.of(size, slide))} or * {@code .window(SlidingProcessingTimeWindows.of(size, slide))} depending on the time * characteristic set using * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)} * * @param size The size of the window. */ public WindowedStream timeWindow(Time size, Time slide) { if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) { return window(SlidingProcessingTimeWindows.of(size, slide)); } else { return window(SlidingEventTimeWindows.of(size, slide)); } }

可以看到，不管是tumbling time windows，还是sliding time windows，底层都是window方法，所以在具体实现时，大多数情况下可以直接使用timewindow来替换window。

filnk中的Time类型

从上面代码中可以看到，有Processing Time、Event Time，其实还有一种，叫Ingestion Time，以下解释转自flink官网及《从0到1学习Flink》(http://www.54tianzhisheng.cn/)—— Flink 中几种 Time 详解中的说明，如下：

Processing Time:

Processing Time 是指事件被处理时机器的系统时间。
当流程序在 Processing Time 上运行时，所有基于时间的操作(如时间窗口)将使用当时机器的系统时间。每小时 Processing Time 窗口将包括在系统时钟指示整个小时之间到达特定操作的所有事件。
例如，如果应用程序在上午 9:15 开始运行，则第一个每小时 Processing Time 窗口将包括在上午 9:15 到上午 10:00 之间处理的事件，下一个窗口将包括在上午 10:00 到 11:00 之间处理的事件。
Processing Time 是最简单的 “Time” 概念，不需要流和机器之间的协调，它提供了最好的性能和最低的延迟。但是，在分布式和异步的环境下，Processing Time 不能提供确定性，因为它容易受到事件到达系统的速度(例如从消息队列)、事件在系统内操作流动的速度以及中断的影响。
Processing time refers to the system time of the machine that is executing the respective operation.
When a streaming program runs on processing time, all time-based operations (like time windows) will use the system clock of the machines that run the respective operator. An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
Processing time is the simplest notion of time and requires no coordination between streams and machines. It provides the best performance and the lowest latency. However, in distributed and asynchronous environments processing time does not provide determinism, because it is susceptible to the speed at which records arrive in the system (for example from the message queue), to the speed at which the records flow between operators inside the system, and to outages (scheduled, or otherwise).

简单来说，processing time 就是在flink集群上，当前数据被处理的时间；以flink集群当前时间为准。

不过就像上面说的，在分布式和异步的场景下，PT无法保证数据处理的时间跟数据真正发生的时间是一致的，因为MQ可能会乱序到达、重试之后到达；而数据在flink中处理时，并发下，某些线程处理速度的快慢也有可能会导致某些数据后发而先至。

Event Time:

Event Time 是事件发生的时间，一般就是数据本身携带的时间。这个时间通常是在事件到达 Flink 之前就确定的，并且可以从每个事件中获取到事件时间戳。在 Event Time 中，时间取决于数据，而跟其他没什么关系。Event Time 程序必须指定如何生成 Event Time 水印，这是表示 Event Time 进度的机制。
完美的说，无论事件什么时候到达或者其怎么排序，最后处理 Event Time 将产生完全一致和确定的结果。但是，除非事件按照已知顺序(按照事件的时间)到达，否则处理 Event Time 时将会因为要等待一些无序事件而产生一些延迟。由于只能等待一段有限的时间，因此就难以保证处理 Event Time 将产生完全一致和确定的结果。
假设所有数据都已到达， Event Time 操作将按照预期运行，即使在处理无序事件、延迟事件、重新处理历史数据时也会产生正确且一致的结果。例如，每小时事件时间窗口将包含带有落入该小时的事件时间戳的所有记录，无论它们到达的顺序如何。
请注意，有时当 Event Time 程序实时处理实时数据时，它们将使用一些 Processing Time 操作，以确保它们及时进行。
Event time is the time that each individual event occurred on its producing device. This time is typically embedded within the records before they enter Flink, and that event timestamp can be extracted from each record. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. This watermarking mechanism is described in a later section, https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html#event-time-and-watermarks.
In a perfect world, event time processing would yield completely consistent and deterministic results, regardless of when events arrive, or their ordering. However, unless the events are known to arrive in-order (by timestamp), event time processing incurs some latency while waiting for out-of-order events. As it is only possible to wait for a finite period of time, this places a limit on how deterministic event time applications can be.
Assuming all of the data has arrived, event time operations will behave as expected, and produce correct and consistent results even when working with out-of-order or late events, or when reprocessing historic data. For example, an hourly event time window will contain all records that carry an event timestamp that falls into that hour, regardless of the order in which they arrive, or when they are processed. (See the section on https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html#late-elements for more information.)
Note that sometimes when event time programs are processing live data in real-time, they will use some processing time operations in order to guarantee that they are progressing in a timely fashion.

简单来说，event time就是数据在各自的业务服务器上产生的时间，跟flink无关。

不过由于ET数据到达的方式可能会出现乱序，flink在处理数据的时候就需要等待一些时间，确保一些无序事件都被处理掉，也就导致了会出现延迟。

另外，由于ET数据处理和flink中时间无关，所以要指定watermark，即水印，用于表示当前数据处理的进度。

Ingestion Time:

Ingestion Time 是事件进入 Flink 的时间。在源操作处，每个事件将源的当前时间作为时间戳，并且基于时间的操作(如时间窗口)会利用这个时间戳。
Ingestion Time 在概念上位于 Event Time 和 Processing Time 之间。与 Processing Time 相比，它稍微重一些，但结果更可预测。因为 Ingestion Time 使用稳定的时间戳(在源处分配一次)，所以对事件的不同窗口操作将引用相同的时间戳，而在 Processing Time 中，每个窗口操作符可以将事件分配给不同的窗口(基于机器系统时间和到达延迟)。
与 Event Time 相比，Ingestion Time 程序无法处理任何无序事件或延迟数据，但程序不必指定如何生成水印。
在 Flink 中，，Ingestion Time 与 Event Time 非常相似，但 Ingestion Time 具有自动分配时间戳和自动生成水印功能。
Ingestion time is the time that events enter Flink. At the source operator each record gets the source’s current time as a timestamp, and time-based operations (like time windows) refer to that timestamp.
Ingestion time sits conceptually in between event time and processing time. Compared to processing time, it is slightly more expensive, but gives more predictable results. Because ingestion time uses stable timestamps (assigned once at the source), different window operations over the records will refer to the same timestamp, whereas in processing time each window operator may assign the record to a different window (based on the local system clock and any transport delay).
Compared to event time, ingestion time programs cannot handle any out-of-order events or late data, but the programs don’t have to specify how to generate watermarks.
Internally, ingestion time is treated much like event time, but with automatic timestamp assignment and automatic watermark generation.

介于PT和ET之间，指数据进入到Flink中的时间。