spark on yarn模式下SparkStream整合kafka踩的各种坑(已解决)_fqzzzzz的博客

博客讲述了在Windows环境下使用Spark Streaming从Kafka消费数据并提交到Yarn集群上遇到的错误及解决办法。错误包括找不到jar包、检查点恢复失败等问题。解决方案包括将Spark Master设置为'yarn',修改检查点路径,确保所有依赖项已添加,并调整Spark配置。此外,还提到了避免使用`setJars`来打包,以及如何优雅地结束Spark Submit进程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

项目场景:

使用sparkStream接收kafka的数据进行计算,并且打包上传到linux进行spark任务的submit


错误集合:

1.错误1:

Failed to add file:/usr/local/spark-yarn/./myapp/sparkDemo04.jar to Spark environment
java.io.FileNotFoundException: Jar D:\usr\local\spark-yarn\myapp\sparkDemo04.jar not found
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

2.windows下ideal中在yarn模式下运行代码出错,显示如下报错

WARN CheckpointReader: Error reading checkpoint from file hdfs://hadoop102:9000/checkpoint6/checkpoint-1637834226000
java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.MappedDStream.mapFunc of type scala.Function1 in instance of org.apache.spark.streaming.dstream.MappedDStream

3.报的一些kafka包notfound的问题,这个下面就不讨论了,只需要把对应的包下载后放到spark目录下的jars文件中即可,比如常见的

java.lang.NoClassDefFoundError: org/apache/spark/kafka010/KafkaConfigUpdater

都可以通过添加包的方式解决,如果是spark shell里面出现这种错误,则需要在输入spark-shell命令时,在后面添加 --jars 包路径
最初的代码:

import com.study.stream05_kafka.SparkKafka.createSSC
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.lang.System.getProperty
import scala.collection.mutable.ListBuffer

object stream05_kafka {

  object SparkKafka{
    def createSSC(): _root_.org.apache.spark.streaming.StreamingContext={
      //    TODO 创建环境对象
      //    StreamingContext创建时,第一个参数表示环境配置,第二个是数据采集周期
      val sparkConf = new SparkConf().setMaster("local[*]").setAppName("kafka2")
      sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true")
      sparkConf.set("spark.hadoop.fs.defaultFS","hdfs://hadoop102:9000")
      sparkConf.set("spark.hadoop.yarn.resoursemanager.address","hadoop103:8088")
      val streamingContext: StreamingContext = new StreamingContext(sparkConf, Seconds(3))
      streamingContext.checkpoint("hdfs://hadoop102:9000/checkpoint6")
      val kafkaPara: Map[String, Object] = Map[String, Object](
        ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop104:9092",
        ConsumerConfig.GROUP_ID_CONFIG -> "second",
        "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
        "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
      )
      //    TODO 逻辑处理
      val kafkaDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
        streamingContext,
        LocationStrategies.PreferConsistent,
        ConsumerStrategies.Subscribe[String, String](Set("sparkOnKafka"), kafkaPara)
      )

      val num: DStream[String] = kafkaDS.map(_.value())
      val result = num.map(
        line=>{
          val flows = line.split(",")
          val up=flows(1).toInt
          val down=flows(2).toInt
          (flows(0),(up,down,up+down))
        }
      ).updateStateByKey(
        (queueValue, buffValue: Option[(Int,Int,Int)]) => {
          val cur=buffValue.getOrElse((0,0,0))
          var curUp=cur._1
          var curDown=cur._2
          for (elem <- queueValue) {
            curUp+=elem._1
            curDown+=elem._2
          }
          Option((curUp,curDown,curUp+curDown))
        }
      )
      result.print()
      streamingContext
    }
  }
  def main(args: Array[String]): Unit = {
    println("**************")
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    System.getProperties.setProperty("HADOOP_USER_NAME", "hadoop")
    val streamingContext = StreamingContext.getActiveOrCreate("hdfs://hadoop102:9000/checkpoint6", ()=>createSSC())
    streamingContext.start()
    //    2.等待关闭
    streamingContext.awaitTermination()

  }

}

原因分析:

首先,这里指出如果要打包到linux 下在yarn模式下进行spark的submit,需要设置master为yarn,至于是yarn-client还是yarn-cluster需要提交任务时指定,默认是client。我这里写成local,所以一开始都是windows下可以正常连接kafka拿到数据进行计算,但是linux下就不行了。归根结底没有连接yarn。
1.错误1是因为windows下spark任务提交的时候,找不到你的jar包,试想一下spark的spark-submit命令,需要指定jar包以及class
2.这个是序列化问题还是广播变量不适合于检查点的问题,查资料发现广播变量的内容写入hdfs后就难以恢复了,这里可以把错误定位到StreamingContext.getActiveOrCreate里面,这里有时候可以正常进行数据恢复,但是有时候就会报错。解决方法还没找到,我就直接换检查点路径了,一般生产环境下也只有代码升级的情况下会关闭流计算,这里就没有深究,希望大神可以解答一下。猜测是读取检查点数据的时候序列化出了问题

解决方案:

错误1的解决:所以如果要在windows下运行,需要先使用mvn package或者build artifacts对程序进行打包,然后对sparkConf.setJars指定包的路径,这样在windows下就可以正常运行了
错误2的解决:这里我就换检查点了
最后贴一下我最终成功运行的代码

import com.study.stream05_kafka.SparkKafka.createSSC
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.lang.System.getProperty
import scala.collection.mutable.ListBuffer

object stream05_kafka {

  object SparkKafka{
    def createSSC(): _root_.org.apache.spark.streaming.StreamingContext={
      //    TODO 创建环境对象
      //    StreamingContext创建时,第一个参数表示环境配置,第二个是数据采集周期
      val sparkConf = new SparkConf().setMaster("yarn").setAppName("kafka2").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      sparkConf.set("spark.streaming.stopGracefullyOnShutdown","true")
      sparkConf.set("spark.hadoop.fs.defaultFS","hdfs://hadoop102:9000")
      sparkConf.set("spark.hadoop.yarn.resoursemanager.address","hadoop103:8088")
      val streamingContext: StreamingContext = new StreamingContext(sparkConf, Seconds(3))
      streamingContext.checkpoint("hdfs://hadoop102:9000/checkpoint7")
      val kafkaPara: Map[String, Object] = Map[String, Object](
        ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop104:9092",
        ConsumerConfig.GROUP_ID_CONFIG -> "second",
        "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
        "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
      )
      //    TODO 逻辑处理
      val kafkaDS: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
        streamingContext,
        LocationStrategies.PreferConsistent,
        ConsumerStrategies.Subscribe[String, String](Set("sparkOnKafka"), kafkaPara)
      )

      val num: DStream[String] = kafkaDS.map(_.value())
      val result = num.map(
        line=>{
          val flows = line.split(",")
          val up=flows(1).toInt
          val down=flows(2).toInt
          (flows(0),(up,down,up+down))
        }
      ).updateStateByKey(
        (queueValue, buffValue: Option[(Int,Int,Int)]) => {
          val cur=buffValue.getOrElse((0,0,0))
          var curUp=cur._1
          var curDown=cur._2
          for (elem <- queueValue) {
            curUp+=elem._1
            curDown+=elem._2
          }
          Option((curUp,curDown,curUp+curDown))
        }
      )
      result.print()
      streamingContext
    }
  }
  def main(args: Array[String]): Unit = {
    println("**************")
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    System.getProperties.setProperty("HADOOP_USER_NAME", "hadoop")
    val streamingContext = StreamingContext.getActiveOrCreate("hdfs://hadoop102:9000/checkpoint7", ()=>createSSC())
//    new Thread(new MonitorStop(streamingContext)).start()
    streamingContext.start()
    //    2.等待关闭
    streamingContext.awaitTermination()

  }

}

另外,打包的时候不要添加setJars,否则还是会报错,报的是什么已经忘了,这篇博客也是在我解决问题之后写的,没有记录太多报错,如果我没记错的话可能会报这种错误

cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD

困惑:

为了解决这个bug,也是在yarn日志和spark日志来回看,看了一天,最让我头疼的就是spark-submit使用control+z退出后,spark-submit进行还会在后台运行,我都怀疑是不是我的kill -9 操作使检查点损坏导致数据恢复失败的,请问各路大神怎么才能结束sparkSubmit进程?

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值