sparksql将hive数据写入tidb【2】

本文介绍了使用Apache Spark的高效写入策略,通过13分钟处理1500万数据,对比初版实现了显著的性能提升,重点展示了如何配置JDBC连接和批处理参数以优化TiDB数据加载。

1500万数据13分钟写完,比第一版效率提升n呗

package com.xxx.warehouse.service


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions

object Test_tidb2 {

 def main(args: Array[String]) {

    val spark = SparkSession.builder
      .appName("Test_tidb2")
      //.master("local[*]")
      .config("spark.driver.allowMultipleContexts", true)
      .config("hive.exec.dynamic.partition.mode", "nonstrict")
      .enableHiveSupport()
      .getOrCreate()

    //spark.sparkContext.setLogLevel("ERROR")

   val customer = spark.sql("select user_id,story_id,voice_type from ads.ads_ks_tips_rurn_pay_user_listen_story_finish_a_d where user_part = '1'")
   // you might repartition source to make it balance across nodes
   // and increase concurrency
   val df = customer.repartition(32)

   df.write
     .mode(saveMode = "append")
     .format("jdbc")
     .option("driver", "com.mysql.jdbc.Driver")
     // replace host and port as your and be sure to use rewrite batch
     .option("url", "jdbc:mysql://tidb.dev.xxx.com:4000/ks_content_assets?rewriteBatchedStatements=true")
     .option("useSSL", "false")
     // As tested, 150 is good practice
     .option(JDBCOptions.JDBC_BATCH_INSERT_SIZE, 150)
     .option("dbtable", s"wxt_test") // database name and table name here
     .option("isolationLevel", "NONE") // recommended to set isolationLevel to NONE if you have a large DF to load.
     .option("user", "root") // TiDB user here
     .option("password", "")
     .save()



   spark.stop()
  }
}
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值