Flink代码阅读之API_tear-down method for the user code. it is called a-CSDN博客

本文链接：https://blog.csdn.net/define_us/article/details/113878760

本文详细介绍了Apache Flink中的ProcessFunction接口及其八个关键实现，包括KeyedProcessFunction、CoProcessFunction、ProcessJoinFunction等，展示了如何使用这些API进行流处理、键值操作、连接和聚合。实例代码和常见应用场景帮助读者理解其实际运用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

概述

在这里插入图片描述

ProcessFunction API/Stateful Stream Process API

最底层的API

一共是八个ProcessFunction

ProcessFunction dataStream
KeyedProcessFunction 用于KeyedStream，keyBy之后的流处理
CoProcessFunction 用于connect连接的流
ProcessJoinFunction 用于join流操作
BroadcastProcessFunction 用于广播
KeyedBroadcastProcessFunction keyBy之后的广播
ProcessWindowFunction 窗口增量聚合
ProcessAllWindowFunction 全窗口聚合

继承自AbstractRichFunction，实现了RichFunction接口

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.api.common.functions;

import org.apache.flink.annotation.Public;
import org.apache.flink.configuration.Configuration;

/**
 * An base interface for all rich user-defined functions. This class defines methods for
 * the life cycle of the functions, as well as methods to access the context in which the functions
 * are executed.
 */
@Public
public interface RichFunction extends Function {

	/**
	 * Initialization method for the function. It is called before the actual working methods
	 * (like <i>map</i> or <i>join</i>) and thus suitable for one time setup work. For functions that
	 * are part of an iteration, this method will be invoked at the beginning of each iteration superstep.
	 *
	 * <p>The configuration object passed to the function can be used for configuration and initialization.
	 * The configuration contains all parameters that were configured on the function in the program
	 * composition.
	 *
	 * <pre>{@code
	 * public class MyMapper extends FilterFunction<String> {
	 *
	 *     private String searchString;
	 *
	 *     public void open(Configuration parameters) {
	 *         this.searchString = parameters.getString("foo");
	 *     }
	 *
	 *     public boolean filter(String value) {
	 *         return value.equals(searchString);
	 *     }
	 * }
	 * }</pre>
	 *
	 * <p>By default, this method does nothing.
	 *
	 * @param parameters The configuration containing the parameters attached to the contract.
	 *
	 * @throws Exception Implementations may forward exceptions, which are caught by the runtime. When the
	 *                   runtime catches an exception, it aborts the task and lets the fail-over logic
	 *                   decide whether to retry the task execution.
	 *
	 * @see org.apache.flink.configuration.Configuration
	 */
	void open(Configuration parameters) throws Exception;

	/**
	 * Tear-down method for the user code. It is called after the last call to the main working methods
	 * (e.g. <i>map</i> or <i>join</i>). For functions that  are part of an iteration, this method will
	 * be invoked after each iteration superstep.
	 *
	 * <p>This method can be used for clean up work.
	 *
	 * @throws Exception Implementations may forward exceptions, which are caught by the runtime. When the
	 *                   runtime catches an exception, it aborts the task and lets the fail-over logic
	 *                   decide whether to retry the task execution.
	 */
	void close() throws Exception;

	// ------------------------------------------------------------------------
	//  Runtime context
	// ------------------------------------------------------------------------

	/**
	 * Gets the context that contains information about the UDF's runtime, such as the
	 * parallelism of the function, the subtask index of the function, or the name of
	 * the of the task that executes the function.
	 *
	 * <p>The RuntimeContext also gives access to the
	 * {@link org.apache.flink.api.common.accumulators.Accumulator}s and the
	 * {@link org.apache.flink.api.common.cache.DistributedCache}.
	 *
	 * @return The UDF's runtime context.
	 */
	RuntimeContext getRuntimeContext();

	/**
	 * Gets a specialized version of the {@link RuntimeContext}, which has additional information
	 * about the iteration in which the function is executed. This IterationRuntimeContext is only
	 * available if the function is part of an iteration. Otherwise, this method throws an exception.
	 *
	 * @return The IterationRuntimeContext.
	 * @throws java.lang.IllegalStateException Thrown, if the function is not executed as part of an iteration.
	 */
	IterationRuntimeContext getIterationRuntimeContext();

	/**
	 * Sets the function's runtime context. Called by the framework when creating a parallel instance of the function.
	 *
	 * @param t The runtime context.
	 */
	void setRuntimeContext(RuntimeContext t);
}

ProcessFunction也是一个抽象类。增加了两个方法。其中out用来收集输出。


	public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;
	public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}

onTimer是在使用定时器fire的时候被调用的。

比如说CoProcessFunction，支持同时对两个流的操作

	public abstract void processElement1(IN1 value, Context ctx, Collector<OUT> out) throws Exception;
    public abstract void processElement1(IN1 value, Context ctx, Collector<OUT> out) throws Exception;

ProcessJoinFunction则用来处理join后的结果。

	public abstract void processElement(IN1 left, IN2 right, Context ctx, Collector<OUT> out) throws Exception;

具体如何使用可以参考如下博客
https://xinchen.blog.csdn.net/article/details/109624375
https://blog.csdn.net/boling_cavalry/article/details/109645214?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-2&spm=1001.2101.3001.4242

下面方式可以产生输入流

   protected KeyedStream<Tuple2<String, Integer>, Tuple> buildStreamFromSocket(StreamExecutionEnvironment env, int port) {
        return env
                // 监听端口
                .socketTextStream("localhost", port)
                // 得到的字符串"aaa,3"转成Tuple2实例，f0="aaa"，f1=3
                .map(new WordCountMap())
                // 将单词作为key分区
                .keyBy(0);
    }

下面方式可以处理流

        SingleOutputStreamOperator<Tuple2<String, Integer>> mainDataStream = stream1
                // 两个流连接
                .connect(stream2)
                // 执行低阶处理函数，具体处理逻辑在子类中实现
                .process(new CoProcessFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {

            @Override
            public void processElement1(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) {
                logger.info("处理1号流的元素：{},", value);
                out.collect(value);
            }

            @Override
            public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) {
                logger.info("处理2号流的元素：{}", value);
                out.collect(value);
            }
        };
    }

);

下面的方式可以写到输出流和侧输出流

DataStream<Integer> input = ...;

final OutputTag<String> outputTag = new OutputTag<String>("side-output"){};

SingleOutputStreamOperator<Integer> mainDataStream = input
  .process(new ProcessFunction<Integer, Integer>() {

      @Override
      public void processElement(
          Integer value,
          Context ctx,
          Collector<Integer> out) throws Exception {
        // 将数据发送到常规输出中
        out.collect(value);

        // 将数据发送到侧输出中
        ctx.output(outputTag, "sideout-" + String.valueOf(value));
      }
    });

一个简单的启动流程如下

      final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 并行度1
        env.setParallelism(1);

        // 监听9998端口的输入
        KeyedStream<Tuple2<String, Integer>, Tuple> stream1 = buildStreamFromSocket(env, 9998);

        // 监听9999端口的输入
        KeyedStream<Tuple2<String, Integer>, Tuple> stream2 = buildStreamFromSocket(env, 9999);

        SingleOutputStreamOperator<Tuple2<String, Integer>> mainDataStream = stream1
                // 两个流连接
                .connect(stream2)
                // 执行低阶处理函数，具体处理逻辑在子类中实现
                .process(getCoProcessFunctionInstance());

        // 将低阶处理函数输出的元素全部打印出来
        mainDataStream.print();

        // 侧输出相关逻辑，子类有侧输出需求时重写此方法
        doSideOutput(mainDataStream);

        // 执行
        env.execute("ProcessFunction demo : CoProcessFunction");

其中的KeyedStream是继承自DataStream。

DataSet/DataStream API

这个API是基于流的
在这里插入图片描述

一个简单的例子如下

   	//设置执行环境。
    	//任务执行环境用于定义任务的属性、创建数据源以及最终启动任务的执行。
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //创建数据源
        //数据源从外部系统例如Apache Kafka、RabbitMQ等接收数据，然后将数据送到 Flink 程序中。
        //这里的TransactionSource就是一个可以产生交易数据的数据源，关于数据源之后再解释
        DataStream<Transaction> transactions = env
            .addSource(new TransactionSource())
            .name("transactions");

        //对事件分区 & 欺诈检测
        //由于欺诈行为是发生在基于某一个账户的，
        //所以必须要保证同一个账户的所有交易行为数据要被同一个并发的task进行处理
        //这里类似于map-reduce中的map操作
        DataStream<Alert> alerts = transactions
            .keyBy(Transaction::getAccountId)
            .process(new FraudDetector())
            .name("fraud-detector");

        //sink数据
        //sink 会将 DataStream 写出到外部系统，例如 Apache Kafka、Cassandra
        //AlertSink 使用 INFO 的日志级别打印每一个 Alert 的数据记录
        alerts
            .addSink(new AlertSink())
            .name("send-alerts");

        //运行作业
        //Flink 程序是懒加载的，并且只有在完全搭建好之后，才能够发布到集群上执行
        //调用 StreamExecutionEnvironment#execute 时给任务传递一个任务名参数，就可以开始运行任务。
        env.execute("Fraud Detection");

其中，FraudDetector也是一个ProcessFunction。

TABLE API

table API实现在flink table目录下
在这里插入图片描述

一个简单的demo如下

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.descriptors.FileSystem;
import org.apache.flink.table.descriptors.OldCsv;
import org.apache.flink.table.descriptors.Schema;
import org.apache.flink.types.Row;

public class JavaStreamWordCount {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);

        String path = JavaStreamWordCount.class.getClassLoader().getResource("words.txt").getPath();
        tEnv.connect(new FileSystem().path(path))
            .withFormat(new OldCsv().field("word", Types.STRING).lineDelimiter("\n"))
            .withSchema(new Schema().field("word", Types.STRING))
            .inAppendMode()
            .registerTableSource("fileSource");

        Table result = tEnv.scan("fileSource")
            .groupBy("word")
            .select("word, count(1) as count");

        tEnv.toRetractStream(result, Row.class).print();
        env.execute();
    }
}

SQL API

同样在flink table目录下实现

首先准备环境

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);

将source注册为一张数据表

Table topScore = tableEnv.fromDataSet(topInput);
tableEnv.registerTable("score", topScore);

然后提交SQL

Table queryResult = tableEnv.sqlQuery("
select player, 
count(season) as num 
FROM score 
GROUP BY player 
ORDER BY num desc 
LIMIT 3
");