Spark源码解析之读取文件

分为两种,一种是是通过SparkContext读取,一种是通过FileSourceScanExec读取(SparkSql属于这种)

1.SparkContext读文件(sc.textFile为例)

  • 先进行逻辑切片,然后根据切片确定分区
  • SparkContext.textFile -> SparkContext.hadoopFile -> HadoopRDD.getPartitions -> TextInputFormat.getSplits -> FileInputFormat.getSplits
  • 计算公式:Math.max(minSize(1B), Math.min(totalSize / minPartitions, blockSize(128MB)))
  • 例如:max( 1B, min( 100MB / 2, 128MB ) ) = 50MB
  • 范围: 1B ≤ 实际块大小 ≤128MB
  • minPartitions = min(defaultParallelism, 2)
  • defaultParallelism取spark.default.parallelism的值,没有取默认值
  • 本地模式:cpu核数
  • cluster模式:max(executor-cores * num-executors, 2)
  • Mesos fine grained mode: 8

2.FileSourceScanExec读取

源码

/**
 * Physical plan node for scanning data from HadoopFsRelations.
 *
 * @param relation The file-based relation to scan.
 * @param output Output attributes of the scan, including data attributes and partition attributes.
 * @param requiredSchema Required schema of the underlying relation, excluding partition columns.
 * @param partitionFilters Predicates to use for partition pruning.
 * @param dataFilters Filters on non-partition columns.
 * @param metastoreTableIdentifier identifier for the table in the metastore.
 */
case class FileSourceScanExec(
    @transient relation: HadoopFsRelation,
    output: Seq[Attribute],
    requiredSchema: StructType,
    partitionFilters: Seq[Expression],
    dataFilters: Seq[Expression],
    override val metastoreTableIdentifier: Option[TableIdentifier])
  extends DataSourceScanExec with ColumnarBatchScan  {

  /**
   * Create an RDD for non-bucketed reads.
   * The bucketed variant of this function is [[createBucketedReadRDD]].
   *
   * @param readFile a function to read each (part of a) file.
   * @param selectedPartitions Hive-style partition that are part of the read.
   * @param fsRelation [[HadoopFsRelation]] associated with the read.
   */
  private def createNonBucketedReadRDD(
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Seq[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
    val 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

话数Science

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值