目录
2. 任务执行命令类:SparkTaskExecuteCommand
4. 任务执行:SparkExecution.execute()
本文基于SeaTunnel 2.3.x源码分析Spark引擎执行流程,以seatunnel-examples/seatunnel-spark-connector-v2-example/src/main/java/org/apache/seatunnel/example/spark/v2/SeaTunnelApiExample.java
为入口,完整解析Spark引擎的执行流程。
1. 任务启动入口
启动类核心代码:
public static void main(String[] args) { // 1. 创建Spark命令参数对象 SparkCommandArgs sparkCommandArgs = new SparkCommandArgs(); // 2. 执行SeaTunnel.run()回调Spark执行命令 SeaTunnel.run(sparkCommandArgs.buildCommand()); }
-
buildCommand()
返回SparkTaskExecuteCommand
实例 -
SeaTunnel.run()
最终调用SparkTaskExecuteCommand.execute()
2. 任务执行命令类:SparkTaskExecuteCommand
核心执行流程:
public void execute() { // 1. 解析配置文件生成Config对象 Config config = ConfigBuilder.of(configFile); // 2. 创建SparkExecution实例 SparkExecution seaTunnelTaskExecution = new SparkExecution(config); // 3. 执行任务 seaTunnelTaskExecution.execute(); }
3. SparkExecution的创建与初始化
3.1 核心组件初始化
public SparkExecution(Config config) { // 创建Spark运行时环境 this.sparkRuntimeEnvironment = SparkRuntimeEnvironment.getInstance(config); JobContext jobContext = new JobContext(); jobContext.setJobMode(RuntimeEnvironment.getJobMode(config)); // 创建三大处理器 this.sourcePluginExecuteProcessor = new SourceExecuteProcessor( sparkRuntimeEnvironment, jobContext, config.getConfigList(Constants.SOURCE)); this.transformPluginExecuteProcessor = new TransformExecuteProcessor( sparkRuntimeEnvironment, jobContext, TypesafeConfigUtils.getConfigList(config, Constants.TRANSFORM, Collections.emptyList())); this.sinkPluginExecuteProcessor = new SinkExecuteProcessor( sparkRuntimeEnvironment, jobContext, config.getConfigList(Constants.SINK)); }
3.2 关键对象说明
组件 | 类型 | 功能 |
---|---|---|
sourcePluginExecuteProcessor | SourceExecuteProcessor | 处理数据源接入 |
transformPluginExecuteProcessor | TransformExecuteProcessor | 处理数据转换逻辑 |
sinkPluginExecuteProcessor | SinkExecuteProcessor | 处理数据输出 |
sparkRuntimeEnvironment | SparkRuntimeEnvironment | 封装SparkSession及运行时环境 |
4. 任务执行:SparkExecution.execute()
DAG构建流程:
public void execute() throws TaskExecuteException { // 初始化数据集集合 List<Dataset<Row>> datasets = new ArrayList<>(); // 按顺序执行三大组件 datasets = sourcePluginExecuteProcessor.execute(datasets); datasets = transformPluginExecuteProcessor.execute(datasets); sinkPluginExecuteProcessor.execute(datasets); log.info("Spark Execution started"); }
5. Source处理流程
5.1 插件初始化
调用链:
SourceExecuteProcessor() → super(sparkRuntimeEnvironment, jobContext, sourceConfigs) // 调用父类构造器 → this.plugins = initializePlugins(pluginConfigs)
插件加载核心逻辑:
protected List<SeaTunnelSource<?, ?, ?>> initializePlugins(List<? extends Config> pluginConfigs) { SeaTunnelSourcePluginDiscovery discovery = new SeaTunnelSourcePluginDiscovery(); List<SeaTunnelSource<?, ?, ?>> sources = new ArrayList<>(); Set<URL> jars = new HashSet<>(); for (Config sourceConfig : pluginConfigs) { // 1. 识别插件类型 PluginIdentifier identifier = PluginIdentifier.of( ENGINE_TYPE, PLUGIN_TYPE, sourceConfig.getString(PLUGIN_NAME)); // 2. 加载依赖JAR jars.addAll(discovery.getPluginJarPaths(Lists.newArrayList(identifier))); // 3. 创建插件实例 SeaTunnelSource<?, ?, ?> source = discovery.createPluginInstance(identifier); // 4. 初始化插件 source.prepare(sourceConfig); source.setJobContext(jobContext); sources.add(source); } // 5. 注册插件JAR到Spark环境 sparkRuntimeEnvironment.registerPlugin(new ArrayList<>(jars)); return sources; }
5.2 数据流生成
执行入口:
public List<Dataset<Row>> execute(List<Dataset<Row>> upstreamDataStreams) { List<Dataset<Row>> sources = new ArrayList<>(); for (int i = 0; i < plugins.size(); i++) { SeaTunnelSource<?, ?, ?> source = plugins.get(i); Config pluginConfig = pluginConfigs.get(i); // 1. 确定并行度 int parallelism = pluginConfig.hasPath(CommonOptions.PARALLELISM.key()) ? pluginConfig.getInt(CommonOptions.PARALLELISM.key()) : sparkRuntimeEnvironment.getSparkConf().getInt( CommonOptions.PARALLELISM.key(), CommonOptions.PARALLELISM.defaultValue()); // 2. 创建Dataset(核心步骤) Dataset<Row> dataset = sparkRuntimeEnvironment .getSparkSession() .read() .format(SeaTunnelSource.class.getSimpleName()) // 使用SeaTunnelSource标识 .option(CommonOptions.PARALLELISM.key(), parallelism) // 设置并行度 .option(Constants.SOURCE_SERIALIZATION, SerializationUtils.objectToString(source)) // 序列化插件实例 .schema((StructType) TypeConverterUtils.convert(source.getProducedType())) // 设置Schema .load(); // 3. 注册临时视图 registerInputTempView(pluginConfig, dataset); sources.add(dataset); } return sources; }
临时视图注册逻辑:
void registerInputTempView(Config config, Dataset<Row> dataset) { if (config.hasPath(RESULT_TABLE_NAME)) { String tableName = config.getString(RESULT_TABLE_NAME); // 创建Spark临时视图 dataset.createOrReplaceTempView(tableName); } }
6. Transform处理流程
6.1 插件初始化
关键校验逻辑(在具体Transform实现中):
public void prepare(Config pluginConfig) { // 必须包含source_table_name和result_table_name if (!pluginConfig.hasPath(SOURCE_TABLE_NAME) || !pluginConfig.hasPath(RESULT_TABLE_NAME)) { throw new IllegalArgumentException("Missing required table name config"); } // 输入输出表名不能相同 if (Objects.equals( pluginConfig.getString(SOURCE_TABLE_NAME), pluginConfig.getString(RESULT_TABLE_NAME))) { throw new IllegalArgumentException("Source and result table names must be different"); } // 调用具体Transform的配置初始化 setConfig(pluginConfig); }
6.2 转换执行
核心处理流程:
public List<Dataset<Row>> execute(List<Dataset<Row>> upstreamDataStreams) { Dataset<Row> input = upstreamDataStreams.get(0); // 默认使用第一个上游流 List<Dataset<Row>> result = new ArrayList<>(); for (int i = 0; i < plugins.size(); i++) { SeaTunnelTransform<SeaTunnelRow> transform = plugins.get(i); Config pluginConfig = pluginConfigs.get(i); // 1. 获取输入流(通过source_table_name查找) Dataset<Row> stream = fromSourceTable(pluginConfig, sparkRuntimeEnvironment) .orElse(input); // 2. 执行转换 input = sparkTransform(transform, stream); // 3. 注册结果表 registerInputTempView(pluginConfig, input); result.add(input); } return result; }
转换算子实现:
private Dataset<Row> sparkTransform(SeaTunnelTransform transform, Dataset<Row> stream) { // 1. 类型系统转换 SeaTunnelDataType<?> inputType = TypeConverterUtils.convert(stream.schema()); transform.setTypeInfo(inputType); // 2. 创建输出Schema StructType outputSchema = (StructType) TypeConverterUtils.convert(transform.getProducedType()); // 3. 创建行转换器 SeaTunnelRowConverter inputConverter = new SeaTunnelRowConverter(inputType); SeaTunnelRowConverter outputConverter = new SeaTunnelRowConverter(transform.getProducedType()); // 4. 通过mapPartitions实现转换 ExpressionEncoder<Row> encoder = RowEncoder.apply(outputSchema); return stream.mapPartitions( (MapPartitionsFunction<Row, Row>) inputIterator -> new TransformIterator( // 自定义迭代器封装转换逻辑 inputIterator, transform, outputSchema, inputConverter, outputConverter), encoder ).filter(row -> row != null); // 过滤空值 }
自定义迭代器TransformIterator关键逻辑:
public class TransformIterator implements Iterator<Row> { @Override public Row next() { Row row = input.next(); // 1. 输入行转SeaTunnelRow SeaTunnelRow inRow = inputConverter.convert(row); // 2. 执行Transform核心逻辑 SeaTunnelRow outRow = (SeaTunnelRow) transform.map(inRow); if (outRow != null) { // 3. 输出行转Spark Row return outputConverter.convert(outRow); } return null; } }
7. Sink处理流程
7.1 插件初始化
特殊处理逻辑:
protected List<SeaTunnelSink<?, ?, ?, ?>> initializePlugins(...) { for (Config sinkConfig : pluginConfigs) { // ... 创建sink实例 // 数据保存模式处理 if (sink instanceof SupportDataSaveMode) { SupportDataSaveMode saveModeSink = (SupportDataSaveMode) sink; saveModeSink.checkOptions(sinkConfig); // 校验配置 } } }
7.2 数据输出
执行流程:
public List<Dataset<Row>> execute(List<Dataset<Row>> upstreamDataStreams) { Dataset<Row> input = upstreamDataStreams.get(0); for (int i = 0; i < plugins.size(); i++) { Config sinkConfig = pluginConfigs.get(i); SeaTunnelSink<?, ?, ?, ?> sink = plugins.get(i); // 1. 获取输入流 Dataset<Row> dataset = fromSourceTable(sinkConfig, sparkRuntimeEnvironment) .orElse(input); // 2. 设置类型信息 sink.setTypeInfo((SeaTunnelRowType) TypeConverterUtils.convert(dataset.schema())); // 3. 处理数据保存模式 if (sink instanceof SupportDataSaveMode) { SupportDataSaveMode saveModeSink = (SupportDataSaveMode) sink; DataSaveMode saveMode = saveModeSink.getDataSaveMode(); saveModeSink.handleSaveMode(saveMode); } // 4. 确定并行度 int parallelism = sinkConfig.hasPath(CommonOptions.PARALLELISM.key()) ? sinkConfig.getInt(CommonOptions.PARALLELISM.key()) : sparkRuntimeEnvironment.getSparkConf().getInt( CommonOptions.PARALLELISM.key(), CommonOptions.PARALLELISM.defaultValue()); // 5. 设置并行度(TODO:当前实现需优化) dataset.sparkSession().read().option(CommonOptions.PARALLELISM.key(), parallelism); // 6. 注入Sink逻辑 SparkSinkInjector.inject(dataset.write(), sink) .option("checkpointLocation", "/tmp") // TODO:需改为可配置 .mode(SaveMode.Append) .save(); } return null; // Sink是终点,无下游数据流 }
Sink输入流获取逻辑:
protected Optional<Dataset<Row>> fromSourceTable(Config pluginConfig, SparkRuntimeEnvironment env) { if (!pluginConfig.hasPath(SOURCE_TABLE_NAME)) { return Optional.empty(); } String sourceTableName = pluginConfig.getString(SOURCE_TABLE_NAME); // 从已注册的临时视图中获取数据集 return Optional.of(env.getSparkSession().read().table(sourceTableName)); }
Sink注入器伪代码:
class SparkSinkInjector { static DataFrameWriter inject(DataFrameWriter writer, SeaTunnelSink sink) { // 通过反射或适配器模式将SeaTunnelSink逻辑注入到Spark的DataFrameWriter return writer.format(SeaTunnelSink.class.getSimpleName()) .option(Constants.SINK_SERIALIZATION, SerializationUtils.objectToString(sink)); } }
执行流程全景图
关键设计总结
-
插件化架构:
-
通过SPI机制动态加载Source/Transform/Sink插件
-
插件发现机制:
SeaTunnel*PluginDiscovery
-
依赖隔离:
registerPlugin()
管理插件JAR
-
-
表驱动DAG:
-
source_table_name
和result_table_name
构成执行链路 -
通过
createOrReplaceTempView()
实现表注册 -
使用
fromSourceTable()
实现表查找
-
-
类型系统转换:
-
TypeConverterUtils
处理SeaTunnel类型与Spark类型的双向转换 -
SeaTunnelRowConverter
实现行级别的数据转换 -
自动Schema推导:
source.getProducedType()
-
-
执行环境封装:
-
SparkRuntimeEnvironment
统一管理SparkSession -
支持批处理和流处理模式
-
集中管理插件依赖和配置
-
-
分布式转换:
-
Transform阶段使用
mapPartitions
实现并行处理 -
自定义
TransformIterator
封装转换逻辑 -
基于Spark的ExpressionEncoder处理类型安全
-
-
Sink适配机制:
-
通过
SparkSinkInjector
桥接SeaTunnelSink和Spark写入体系 -
支持不同的保存模式(Append/Overwrite等)
-
自动处理并行度和检查点配置
-
本文完整解析了SeaTunnel Spark引擎从任务启动到DAG构建的全流程,重点突出了插件加载机制、表驱动DAG设计和类型转换系统等核心设计,为深入理解SeaTunnel内部工作原理提供了详细参考。