####官网地址:
https://nightlies.apache.org/flink/flink-docs-release-1.12/zh/dev/table/streaming/match_recognize.html
####我们有实时的机房温度监控数据,机房温度过高会触发告警。我们想知道每个机房每次告警的起止时间和平均温度等数据
#导入依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>${flink.version}</version>
</dependency>
####数据准备:准备temp_record.csv
数据文件。其中每列的含义分别为:机架编号,时间戳,温度。
1,2021-09-14 15:37:03.3,30
1,2021-09-14 15:37:13.3,50
1,2021-09-14 15:37:23.3,55
1,2021-09-14 15:37:33.3,60
1,2021-09-14 15:37:43.3,55
1,2021-09-14 15:37:53.3,50
1,2021-09-14 15:38:03.3,45
2,2021-09-14 15:37:03.3,30
2,2021-09-14 15:37:13.3,50
2,2021-09-14 15:37:23.3,55
2,2021-09-14 15:37:33.3,45
2,2021-09-14 15:37:43.3,55
2,2021-09-14 15:37:53.3,50
2,2021-09-14 15:38:03.3,45
注意:CSV最后必须要有空行,否则解析时候报错。
####接下来根据实例数据的schema,编写create table语句并检查是否能正常读入数据。
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)
val createTableSql =
"""
|create table temp_record (
|rack_id int,
|ts timestamp(3),
|temp int,
|WATERMARK FOR ts AS ts - INTERVAL '1' SECOND)
|with (
|'connector'='filesystem',
|'path'='/path/to/temp_record.csv',
|'format'='csv'
|)
|""".stripMargin
tEnv.executeSql(createTableSql)
tEnv.executeSql("select * from temp_record").print()
####如果一切无误,我们可以看到控制台打印出了CSV文件的内容,如下所示:
+----+-------------+-------------------------+-------------+
| op | rack_id | ts | temp |
+----+-------------+-------------------------+-------------+
| +I | 1 | 2021-09-14 15:37:53.300 | 50 |
| +I | 1 | 2021-09-14 15:37:23.300 | 55 |
| +I | 1 | 2021-09-14 15:37:43.300 | 55 |
| +I | 2 | 2021-09-14 15:37:43.300 | 55 |
| +I | 1 | 2021-09-14 15:38:03.300 | 45 |
| +I | 2 | 2021-09-14 15:37:03.300 | 30 |
| +I | 1 | 2021-09-14 15:37:03.300 | 30 |
| +I | 1 | 2021-09-14 15:37:13.300 | 50 |
| +I | 2 | 2021-09-14 15:37:33.300 | 45 |
| +I | 2 | 2021-09-14 15:37:13.300 | 50 |
| +I | 1 | 2021-09-14 15:37:33.300 | 60 |
| +I | 2 | 2021-09-14 15:37:53.300 | 50 |
| +I | 2 | 2021-09-14 15:38:03.300 | 45 |
| +I | 2 | 2021-09-14 15:37:23.300 | 55 |
+----+-------------+-------------------------+-------------+
####编写业务逻辑
在这个例子中。我们假设机器告警温度为大于等于50度。我们需要查询出每个机架的高温告警开始时间,高温告警结束时间,告警起始温度,告警结束温度和告警期间平均温度。按照业务需求,我们编写SQL如下。
val cepSql =
"""
|select * from temp_record
|match_recognize(
|partition by rack_id
|order by ts
|measures
|A.ts as start_ts,
|last(B.ts) as end_ts,
|A.temp as start_temp,
|last(B.temp) as end_temp,
|avg(B.temp) as avg_temp
|one row per match
|after match skip to next row
|pattern (A B+ C) within interval '90' second
|define
|A as A.temp < 50,
|B as B.temp >= 50,
|C as C.temp < 50
|)
|""".stripMargin
tEnv.executeSql(cepSql).print()
####执行结果:
+----+-------------+-------------------------+-------------------------+-------------+-------------+-------------+
| op | rack_id | start_ts | end_ts | start_temp | end_temp | avg_temp |
+----+-------------+-------------------------+-------------------------+-------------+-------------+-------------+
| +I | 2 | 2021-09-14 15:37:03.300 | 2021-09-14 15:37:23.300 | 30 | 55 | 52 |
| +I | 2 | 2021-09-14 15:37:33.300 | 2021-09-14 15:37:53.300 | 45 | 50 | 52 |
| +I | 1 | 2021-09-14 15:37:03.300 | 2021-09-14 15:37:53.300 | 30 | 50 | 54 |
+----+-------------+-------------------------+-------------------------+-------------+-------------+-------------+
SQL部分解析:
- 我们需要将不同rack_id的数据分开统计,因此需要
partition by rack_id
。 - measures中使用
last(B.ts) as end_ts
获取最后一个对应pattern B的元素, 由define子句可知,pattern B为温度大于等于50的元素。所以它的含义为高温告警期间最后一个告警的温度。 - pattern by子句和measures子句两者共同决定的输出表格的schema。
- pattern子句和define子句两者一起确定了匹配模板,含义为一个温度低于50度的元素,紧跟一个或多个温度大于等于50度的元素,然后再跟一个温度低于50度的元素。