Spark打印每个split及其对应文件

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.HadoopRDD
import org.apache.hadoop.mapred.InputSplit  

var rdd=sc.hadoopFile("hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
val hadoopRdd = rdd.asInstanceOf[HadoopRDD[LongWritable, Text]]


hadoopRdd.mapPartitionsWithInputSplit((inputSplit:InputSplit,iterator:Iterator[(LongWritable, Text)]) =>{  
val file = inputSplit.asInstanceOf[FileSplit]  
var fileSet:Set[String] = Set()
fileSet += file.getPath.toString()
Seq("split,file:"+fileSet.mkString(";")).iterator
}  
).collect().foreach(println)

打印内容如下:

split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000000_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000001_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000002_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000003_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000004_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000005_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000006_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000007_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000008_0
split,file:hdfs://namenode:9000/home/hdp-ads-audit/dubhe_data/hive//tmp/test/dt=2017-03-15/000009_0

猜你喜欢

转载自blog.csdn.net/wisgood/article/details/78153839
今日推荐