SparkStreaming消费Kafka数据kafkaRDD转DataSet的小问题

本文将介绍sparkStreaming直连kafka的方式
1:本来是想直接在foreachRDD的时候把rdd转成DataSetdans
但是kafka的RDD是ConsumerRecord[String, String]类型的,key是offset,partition等等一些信息,value是数据。DataSet其实就相当于一张有scame信息的表val value = rdd.map(rd => rd.value())
spark.createDataset(value)相当于把RDD中的数据都对应到一张表里,这样转是可以的,但是val ds = spark.createDataset(rdd).map(r => r.value())这样先把kafkaRDD转成DataSet再取数据是不行的,会一直报缺少隐式转换的错误,原因应该是DataSet的Row信息无法与KafkaRDD匹配
2:rdd转dataSet还有两种方法种方法
(1)
val rdd: org.apache.spark.rdd.RDD[Row] = null
val schema = StructType(Seq(
StructField(“textField”, StringType, nullable = false))
)
val dataset = SparkSession.builder().getOrCreate().createDataFrame(rdd, schema)
(2) rdd.toDS()

val conf = ...
val ssc = new StreamingContext(conf, Seconds(2)) //conf 和sparkContext都可以创建ssc
val stream = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
     ConsumerStrategies.Subscribe[String, String](Params.topic, Params.kafkaParams))
val spark = SparkSession
     .builder().config(sparkConf).getOrCreate()
     import spark.implicits._ //使用dataferam 和dataDst必须加
         stream.foreachRDD(rdd => {
  
   val value = rdd.map(rd => rd.value())    
   val ds: Dataset[String] = spark.createDataset(value)   //这里的rdd要转成DataSet必须使用kafkaRDD的value
   ds.foreachPartition(iter => {
     iter.foreach(rdddata => {
       println(rdddata)
     })
   })
 })
 ssc.start()
 ssc.awaitTermination()

猜你喜欢

转载自blog.csdn.net/qq_37923600/article/details/87865967