概述
ProcessFunction API/Stateful Stream Process API
最底层的API
一共是八个ProcessFunction
- ProcessFunction dataStream
- KeyedProcessFunction 用于KeyedStream,keyBy之后的流处理
- CoProcessFunction 用于connect连接的流
- ProcessJoinFunction 用于join流操作
- BroadcastProcessFunction 用于广播
- KeyedBroadcastProcessFunction keyBy之后的广播
- ProcessWindowFunction 窗口增量聚合
- ProcessAllWindowFunction 全窗口聚合
继承自AbstractRichFunction,实现了RichFunction接口
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.flink.api.common.functions;
import org.apache.flink.annotation.Public;
import org.apache.flink.configuration.Configuration;
/**
* An base interface for all rich user-defined functions. This class defines methods for
* the life cycle of the functions, as well as methods to access the context in which the functions
* are executed.
*/
@Public
public interface RichFunction extends Function {
/**
* Initialization method for the function. It is called before the actual working methods
* (like <i>map</i> or <i>join</i>) and thus suitable for one time setup work. For functions that
* are part of an iteration, this method will be invoked at the beginning of each iteration superstep.
*
* <p>The configuration object passed to the function can be used for configuration and initialization.
* The configuration contains all parameters that were configured on the function in the program
* composition.
*
* <pre>{@code
* public class MyMapper extends FilterFunction<String> {
*
* private String searchString;
*
* public void open(Configuration parameters) {
* this.searchString = parameters.getString("foo");
* }
*
* public boolean filter(String value) {
* return value.equals(searchString);
* }
* }
* }</pre>
*
* <p>By default, this method does nothing.
*
* @param parameters The configuration containing the parameters attached to the contract.
*
* @throws Exception Implementations may forward exceptions, which are caught by the runtime. When the
* runtime catches an exception, it aborts the task and lets the fail-over logic
* decide whether to retry the task execution.
*
* @see org.apache.flink.configuration.Configuration
*/
void open(Configuration parameters) throws Exception;
/**
* Tear-down method for the user code. It is called after the last call to the main working methods
* (e.g. <i>map</i> or <i>join</i>). For functions that are part of an iteration, this method will
* be invoked after each iteration superstep.
*
* <p>This method can be used for clean up work.
*
* @throws Exception Implementations may forward exceptions, which are caught by the runtime. When the
* runtime catches an exception, it aborts the task and lets the fail-over logic
* decide whether to retry the task execution.
*/
void close() throws Exception;
// ------------------------------------------------------------------------
// Runtime context
// ------------------------------------------------------------------------
/**
* Gets the context that contains information about the UDF's runtime, such as the
* parallelism of the function, the subtask index of the function, or the name of
* the of the task that executes the function.
*
* <p>The RuntimeContext also gives access to the
* {@link org.apache.flink.api.common.accumulators.Accumulator}s and the
* {@link org.apache.flink.api.common.cache.DistributedCache}.
*
* @return The UDF's runtime context.
*/
RuntimeContext getRuntimeContext();
/**
* Gets a specialized version of the {@link RuntimeContext}, which has additional information
* about the iteration in which the function is executed. This IterationRuntimeContext is only
* available if the function is part of an iteration. Otherwise, this method throws an exception.
*
* @return The IterationRuntimeContext.
* @throws java.lang.IllegalStateException Thrown, if the function is not executed as part of an iteration.
*/
IterationRuntimeContext getIterationRuntimeContext();
/**
* Sets the function's runtime context. Called by the framework when creating a parallel instance of the function.
*
* @param t The runtime context.
*/
void setRuntimeContext(RuntimeContext t);
}
ProcessFunction也是一个抽象类。增加了两个方法。其中out用来收集输出。
public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;
public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}
onTimer是在使用定时器fire的时候被调用的。
比如说CoProcessFunction,支持同时对两个流的操作
public abstract void processElement1(IN1 value, Context ctx, Collector<OUT> out) throws Exception;
public abstract void processElement1(IN1 value, Context ctx, Collector<OUT> out) throws Exception;
ProcessJoinFunction则用来处理join后的结果。
public abstract void processElement(IN1 left, IN2 right, Context ctx, Collector<OUT> out) throws Exception;
具体如何使用可以参考如下博客
https://xinchen.blog.csdn.net/article/details/109624375
https://blog.csdn.net/boling_cavalry/article/details/109645214?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-2&spm=1001.2101.3001.4242
下面方式可以产生输入流
protected KeyedStream<Tuple2<String, Integer>, Tuple> buildStreamFromSocket(StreamExecutionEnvironment env, int port) {
return env
// 监听端口
.socketTextStream("localhost", port)
// 得到的字符串"aaa,3"转成Tuple2实例,f0="aaa",f1=3
.map(new WordCountMap())
// 将单词作为key分区
.keyBy(0);
}
下面方式可以处理流
SingleOutputStreamOperator<Tuple2<String, Integer>> mainDataStream = stream1
// 两个流连接
.connect(stream2)
// 执行低阶处理函数,具体处理逻辑在子类中实现
.process(new CoProcessFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
@Override
public void processElement1(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) {
logger.info("处理1号流的元素:{},", value);
out.collect(value);
}
@Override
public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) {
logger.info("处理2号流的元素:{}", value);
out.collect(value);
}
};
}
);
下面的方式可以写到输出流和侧输出流
DataStream<Integer> input = ...;
final OutputTag<String> outputTag = new OutputTag<String>("side-output"){};
SingleOutputStreamOperator<Integer> mainDataStream = input
.process(new ProcessFunction<Integer, Integer>() {
@Override
public void processElement(
Integer value,
Context ctx,
Collector<Integer> out) throws Exception {
// 将数据发送到常规输出中
out.collect(value);
// 将数据发送到侧输出中
ctx.output(outputTag, "sideout-" + String.valueOf(value));
}
});
一个简单的启动流程如下
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 并行度1
env.setParallelism(1);
// 监听9998端口的输入
KeyedStream<Tuple2<String, Integer>, Tuple> stream1 = buildStreamFromSocket(env, 9998);
// 监听9999端口的输入
KeyedStream<Tuple2<String, Integer>, Tuple> stream2 = buildStreamFromSocket(env, 9999);
SingleOutputStreamOperator<Tuple2<String, Integer>> mainDataStream = stream1
// 两个流连接
.connect(stream2)
// 执行低阶处理函数,具体处理逻辑在子类中实现
.process(getCoProcessFunctionInstance());
// 将低阶处理函数输出的元素全部打印出来
mainDataStream.print();
// 侧输出相关逻辑,子类有侧输出需求时重写此方法
doSideOutput(mainDataStream);
// 执行
env.execute("ProcessFunction demo : CoProcessFunction");
其中的KeyedStream是继承自DataStream。
DataSet/DataStream API
这个API是基于流的
一个简单的例子如下
//设置执行环境。
//任务执行环境用于定义任务的属性、创建数据源以及最终启动任务的执行。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//创建数据源
//数据源从外部系统例如Apache Kafka、RabbitMQ等接收数据,然后将数据送到 Flink 程序中。
//这里的TransactionSource就是一个可以产生交易数据的数据源,关于数据源之后再解释
DataStream<Transaction> transactions = env
.addSource(new TransactionSource())
.name("transactions");
//对事件分区 & 欺诈检测
//由于欺诈行为是发生在基于某一个账户的,
//所以必须要保证同一个账户的所有交易行为数据要被同一个并发的task进行处理
//这里类似于map-reduce中的map操作
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getAccountId)
.process(new FraudDetector())
.name("fraud-detector");
//sink数据
//sink 会将 DataStream 写出到外部系统,例如 Apache Kafka、Cassandra
//AlertSink 使用 INFO 的日志级别打印每一个 Alert 的数据记录
alerts
.addSink(new AlertSink())
.name("send-alerts");
//运行作业
//Flink 程序是懒加载的,并且只有在完全搭建好之后,才能够发布到集群上执行
//调用 StreamExecutionEnvironment#execute 时给任务传递一个任务名参数,就可以开始运行任务。
env.execute("Fraud Detection");
其中,FraudDetector也是一个ProcessFunction。
TABLE API
table API实现在flink table目录下
一个简单的demo如下
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;
public class JavaStreamWordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
String path = JavaStreamWordCount.class.getClassLoader().getResource("words.txt").getPath();
tEnv.connect(new FileSystem().path(path))
.withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
.withSchema(new Schema().field("word", Types.STRING))
.inAppendMode()
.registerTableSource("fileSource");
Table result = tEnv.scan("fileSource")
.groupBy("word")
.select("word, count(1) as count");
tEnv.toRetractStream(result, Row.class).print();
env.execute();
}
}
SQL API
同样在flink table目录下实现
首先准备环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);
将source注册为一张数据表
Table topScore = tableEnv.fromDataSet(topInput);
tableEnv.registerTable("score", topScore);
然后提交SQL
Table queryResult = tableEnv.sqlQuery("
select player,
count(season) as num
FROM score
GROUP BY player
ORDER BY num desc
LIMIT 3
");