spark 导入文件到hive出现多小文件的问题

shdqiu

于 2018-09-04 17:38:57 发布

阅读量2.9k

点赞数

CC 4.0 BY-SA版权

分类专栏： spark hive

本文链接：https://blog.csdn.net/qq_39160721/article/details/82387328

spark 同时被 2 个专栏收录

8 篇文章

订阅专栏

hive

5 篇文章

订阅专栏

环境：

ambari:2.6.1

spark 2.1

python 3.6

oracle 11.2

sqoop 1.4

将sqoop采集到HDFS中的文件导入到HIVE数据库，导入成功后，发现hive数据库中出现多个小文件的问题，严重影响后期数据分析的加载速度。

解决方法：


SJTable = spark.sql("select  *          from " + tablename + "_tmp where att = '1E'")
datanum = SJTable.count()
#解决小文件
SJTable_tmp = SJTable.repartition(1).persist()
SJTable_tmp.createOrReplaceTempView(tablename + "_cpu_tmp")

    spark.sql("insert into table " + tablename + "_cpusj PARTITION(area,timdate) select  lcn,pid,tim,tf,fee,bal,epid,etim,card_type,service_code,is_area_code,use_area_code \
                       ,clea_day,CURRENT_TIMESTAMP,use_area_code as area,substr(tim,1,6) as timdate from " + tablename + "_cpu_tmp")

修改后的文件：