Spark/Scala - Read RcFile && OrcFile

I. Introduction

As mentioned above,  MapReduce - reading OrcFile, RcFile files , here is realized by Java + MapReduce to read RcFile and OrcFile files, and later encountered  MapReduce - reading RcFile and OrcFile at the same time The dependency conflict is also successfully resolved, but the usual development I am still used to spark, so I use spark to implement the functions of reading OrcFile and RcFile and Map-Reduce.

2. Read RcFile

We have already understood the form of RcFile in the previous mr task. The form of the key is not critical. LongWritable is the line number, NullWritable is to ignore the line number, mainly in the form of value: BytesRefArrayWritable, so you can use spark's hadoopFile API to implement RcFile To read, first look at the parameters of hadoopFile:

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)]

The keyClass and ValueClass have been determined above, and the inputFormat is also determined. According to MR - MultipleInputs.addInputPath, it can be known that its inputFormatClass is RCFileInputFormat, and the following starts to read:

    val conf = (new SparkConf).setAppName("TestReadRcFile").setMaster("local[1]")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    val sc = spark.sparkContext

    val rcFileInput = “”

    val minPartitions = 100

    println(repeatString("=", 30) + "开始读取 RcFile" + repeatString("=", 30))

    val rcFileRdd = sc.hadoopFile(rcFileInput, classOf[org.apache.hadoop.hive.ql.io.RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable], minPartitions)
      .map(line => {
        val key = LazyBinaryRCFileUtils.readString(line._2.get(0))
        val value = LazyBinaryRCFileUtils.readString(line._2.get(1))
        (key, value)
      })

    println(repeatString("=", 30) + "结束读取 RcFile" + repeatString("=", 30))

 

3. Read OrcFile

Compared with reading RcFile, hadoopFile needs to be used, because SparkSession provides an API to directly read orcFile, which makes spark read OrcFile quite smooth. Note that the data set returned after orc read needs to be converted to Spark through .rdd. Regular RDDs.

    val conf = (new SparkConf).setAppName("TestReadRcFile").setMaster("local[1]")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    println(repeatString("=", 30) + "开始读取 OrcFile" + repeatString("=", 30))

    import spark.implicits._
    val orcInput = “”
    val orcFileRdd = spark.read.orc(orcInput).map(row => {
       val key = row.getString(0)
       val value = row.getString(1)
      (key, value)
    }).rdd

    println(repeatString("=", 30) + "结束读取 OrcFile" + repeatString("=", 30))

 

4. Spark implements Map-Reduce

The above two PairRdds, rcFileRdd and orcFileRdd, can be regarded as two Mappers. The reduce operation is performed below, and each pairRdd is merged through union. Then groupByKey is performed to aggregate the value of the target key, and then the reduce operation can be performed:

    rcFileRdd.union(orcFileRdd).groupByKey().map(info => {
      val key: String = info._1
      val values: Iterable[String] = info._2
        ... reduce func ...
    })

5. Summary

Compared with MR reading RcFile and OrcFile, spark's API is relatively simple, and friends who need it can try it, Nais~

Guess you like

Origin blog.csdn.net/BIT_666/article/details/124311979