分为两种,一种是是通过SparkContext读取,一种是通过FileSourceScanExec读取(SparkSql属于这种)
1.SparkContext读文件(sc.textFile为例)
- 先进行逻辑切片,然后根据切片确定分区
- SparkContext.textFile -> SparkContext.hadoopFile -> HadoopRDD.getPartitions -> TextInputFormat.getSplits -> FileInputFormat.getSplits
- 计算公式:Math.max(minSize(1B), Math.min(totalSize / minPartitions, blockSize(128MB)))
- 例如:max( 1B, min( 100MB / 2, 128MB ) ) = 50MB
- 范围: 1B ≤ 实际块大小 ≤128MB
- minPartitions = min(defaultParallelism, 2)
- defaultParallelism取spark.default.parallelism的值,没有取默认值
- 本地模式:cpu核数
- cluster模式:max(executor-cores * num-executors, 2)
- Mesos fine grained mode: 8
2.FileSourceScanExec读取
源码
/**
* Physical plan node for scanning data from HadoopFsRelations.
*
* @param relation The file-based relation to scan.
* @param output Output attributes of the scan, including data attributes and partition attributes.
* @param requiredSchema Required schema of the underlying relation, excluding partition columns.
* @param partitionFilters Predicates to use for partition pruning.
* @param dataFilters Filters on non-partition columns.
* @param metastoreTableIdentifier identifier for the table in the metastore.
*/
case class FileSourceScanExec(
@transient relation: HadoopFsRelation,
output: Seq[Attribute],
requiredSchema: StructType,
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression],
override val metastoreTableIdentifier: Option[TableIdentifier])
extends DataSourceScanExec with ColumnarBatchScan {
/**
* Create an RDD for non-bucketed reads.
* The bucketed variant of this function is [[createBucketedReadRDD]].
*
* @param readFile a function to read each (part of a) file.
* @param selectedPartitions Hive-style partition that are part of the read.
* @param fsRelation [[HadoopFsRelation]] associated with the read.
*/
private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
val